http://modelingsocialdata.org/
Mon, 15 Apr 2019 14:54:17 +0000Homework 4
http://modelingsocialdata.org/homework/2019/04/10/homework-4.html
http://modelingsocialdata.org/homework/2019/04/10/homework-4.html<p>The fourth homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_4">posted on Github</a>, is due on Thursday, April 25 by 11:59pm ET.</p>
<p>The first problem explores the small-world phenomenon in “close” vs. “distant” friend networks, the second studies how the structure of an email network changes as we remove weak ties from it, and the third looks at gender assortativity in networks. Details are in the README.md file for each problem.</p>
<p>Your code and results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.</p>
Wed, 10 Apr 2019 10:00:00 +0000Lecture 10: Networks
http://modelingsocialdata.org/lectures/2019/04/05/lecture-10-networks.html
http://modelingsocialdata.org/lectures/2019/04/05/lecture-10-networks.html<p>We used this lecture to first go through applications of logistic regression and then to discuss the history of network science.</p>
<!-- We spent this lecture discussing network data, including a whirlwhind tour of the history of network theory, representations and characteristics of networks, and algorithms for analyzing network data. -->
<center>
<script async="" class="speakerdeck-embed" data-id="7848c1385ff346709bae389edb62613d" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>We started off this lecture by revisiting logistic regression, looking at the problem of modeling which passengers <a href="https://www.kaggle.com/c/titanic">survived the Titanic disaster</a>. We saw that interpreting logistic regression results can be challenging, as coefficients give information about changes in log-odds (as opposed to probabilities directly). We stressed the idea of converting back to probabilities and visually comparing predicted and actual values for a range of feature values to better understand the model fit. See <a href="http://htmlpreview.github.io/?https://github.com/jhofman/msd2019/blob/master/lectures/lecture_10/interpreting_logistic_regression.html">this notebook</a> for details.</p>
<p>Next we discussed <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki">Vowpal Wabbit</a> (VW), an open source tool for various machine learning tasks. VW has many attractive features, such as a flexible input format, speed, scalability, and sensible defaults. For binary classification, VW defaults to fitting a (clipped) linear model to minimize squared loss. We looked at an example of <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Rcv1-example">classifying news</a> with VW to get a sense of the interface and performance, which is quite competetive.</p>
<p>Then we moved on to a history of nertwork science.</p>
<p>We talked about some of the earliest studies of networks, such as Jacob Moreno’s <a href="https://timesmachine.nytimes.com/timesmachine/1933/04/03/99218765.html?action=click&contentCollection=Archives&module=LedeAsset&region=ArchiveBody&pgtype=article&pageNumber=17">sociograms</a> and Mark Granovetter’s work on the <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">strength of weak ties</a>. We contrasted theoretical models of graphs (e.g., <a href="http://en.wikipedia.org/wiki/Erdős–Rényi_model">Erdős–Rényi</a> random graphs) to real-world networks, which tend to have highly <a href="http://en.wikipedia.org/wiki/Complex_network#Scale-free_networks">skewed degree distributions</a> as originally discussed in Derek de Solla Price’s studies of <a href="http://garfield.library.upenn.edu/papers/pricenetworks1965.pdf">citation networks</a>. At the same time, social networks typically have <a href="http://en.wikipedia.org/wiki/Small-world_network">short path lengths</a>, in the sense that one needs only to traverse a handful of links to connect a randomly selected set of people in the network.</p>
<p>We finished by discussing different types of networks that we might analyze as well as the various levels of abstraction available for representing them.</p>
<p>More on networks next time.</p>
<!--
, we turned to algorithms for efficiently computing shortest path lengths, connected components, mutual friends, and clustering coefficients.
We started with the problem of finding the shortest distance between a single source node and all other nodes in a (undirected, unweighted) network, as measured by the fewest number of edges you need to traverse to get from the source to every other node.
(Every researcher's favorite version of this is computing their [Erdős number](http://en.wikipedia.org/wiki/Erdős_number), the academic take on the more well-known [Kevin Bacon game](http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon). Compute yours [here](http://academic.research.microsoft.com).)
Breadth first search (BFS) provides a nice solution.
The intuition behind BFS is simple: we start from the source node and mark it as distance zero from itself.
Then we visit each of its neighbors and mark those as distance one.
We repeat this iteratively, pushing forward a boundary of recently discovered nodes that are one additional hop from the source at each step.
BFS visits each node and edge in a network once, scaling linearly in the size of the network.
If, however, we would like to find the shortest distance between _all pairs_ of nodes then we must repeat this for each possible source node, and so this quickly becomes prohibitively expensive for even moderately sized networks.
(See [here](http://en.wikipedia.org/wiki/Shortest_path_problem#All-pairs_shortest_paths) for fancier, more efficient algorithms.)
Next we looked at using BFS for a related problem: finding the number of [connected components](http://en.wikipedia.org/wiki/Connected_component_(graph_theory)), or separate pieces, of a network.
We did this by simply looping over our shortest path code, seeding it on each iteration with a currently unreachable node as the source until we reach all nodes.
We gave the reachable nodes in each BFS a unique label corresponding to its component.
Then we moved on to computing the number of friends that any two nodes have in common, motivated by the problem of friend recommendations on social networks.
The underlying idea can be traced back to [Granovetter](https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf): two people are likely to know each other if they have many mutual friends.
To compute the number of mutual friends between all pairs of nodes, we exploit the fact that the neighbors of every node share that node as a common friend.
To count all mutual friends we simply loop over each node and increment a counter for every pair of its neighbors.
For each node this scales as the square of its degree, so the whole algorithm scales as the sum of the squared degrees of all nodes.
This can quickly become expensive if we have even a few high-degree nodes, which are quite common in practice.
Finally, we looked at the closely related problem of counting the number of triangles around each node in a network.
This algorithm is nearly identical to computing mutual friends, as we generate the same set of two-hop paths through all pairs of a node's neighbors, but simply increment different counters to generate different results.
Instead of accumulating mutual friends for each pair of a node's neighbors, we ask whether every pair of neighbors are themselves directly connected.
If so, we count this as (half of) a triangle in which the node participates.
Dividing the number of closed triangles in a network by the number of possible triangles that could be present gives a useful for how [clustered](http://en.wikipedia.org/wiki/Clustering_coefficient) a network is.
To better understand properties of networks and how to compute them, we looked at a few example networks in R using the ``igraph`` package.
See the [notebooks](https://github.com/jhofman/msd2017/tree/master/lectures/lecture_10) on the course GitHub page for related code and data used in the lectures.
-->
<p>References:</p>
<ul>
<li>The <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki">Vowpal Wabbit Wiki</a></li>
<li>Chapters 2, 18, and 20 of Easley and Kleinberg’s <a href="http://www.cs.cornell.edu/home/kleinber/networks-book/">Networks, Crowds, and Markets</a></li>
<li>Granovetter’s <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">Strength of Weak Ties</a> paper</li>
<li>de Solla Price on <a href="http://garfield.library.upenn.edu/papers/pricenetworks1965.pdf">citation networks</a> and <a href="http://garfield.library.upenn.edu/price/pricetheory1976.pdf">cumulative advantage</a></li>
<li>Milgram’s original <a href="https://en.wikipedia.org/wiki/Small-world_experiment">small world experiment</a></li>
<li><a href="https://www.math.cornell.edu/m/sites/default/files/imported/People/strogatz/nature_smallworld.pdf">Collective dynamics of ‘small-world’ networks</a> by Watts & Strogatz</li>
</ul>
<!--
* [Four degrees of separation](http://web.stanford.edu/~jugander/papers/websci12-fourdegrees.pdf): scaling up calculations to the entire Facebook social graph
* [Customizable route planning](http://www.rebennack.net/SEA2011/files/talks/SEA2011_Pajor.pdf): how shortest path calculations are done in modern mapping applications
* These [slides](https://berkeleydatascience.files.wordpress.com/2012/03/20120320berkeley.pdf) on the early system for friend recommendation on Facebook (pages 28 to 37)
-->
<!--
BFS computes shortest path: http://www.cs.toronto.edu/~krueger/cscB63h/lectures/BFS.pdf
BFS runtime and correctness: http://www.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL06.ps
[MapReduce for networks](http://jakehofman.com/icwsm2010/slides.html)
https://github.com/jhofman/icwsm2010_tutorial
[Curse of the last reducer](http://theory.stanford.edu/~sergei/papers/www11-triangles.pdf)
[Model of MapReduce](http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf)
[Facebook at scale](http://arxiv.org/abs/1111.4503)
-->
Fri, 05 Apr 2019 10:00:00 +0000Lecture 9: Classification
http://modelingsocialdata.org/lectures/2019/03/29/lecture-9-classification.html
http://modelingsocialdata.org/lectures/2019/03/29/lecture-9-classification.html<p>In this lecture we covered classification with linear models, specifically naive Bayes and logistics regression.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="46903fe715de4ab59c254c6a61ea866d" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>We started this lecture by introducing the problem of classification and how it differs from regression: the outcome is categorical (e.g., whether an email is spam or <a href="https://wiki.apache.org/spamassassin/Ham">ham</a>) rather than continuous.
We first reviewed <a href="http://en.wikipedia.org/wiki/Bayes'_rule">Bayes’ rule</a> for inverting conditional probabilities via a simple, but <a href="http://bit.ly/ggbbc">somewhat counterintuitive</a>, <a href="http://www.scientificamerican.com/article/what-is-bayess-theorem-an/">medical diagnosis example</a> and then adapted this to an (extremely naive) <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_9/enron_naive_bayes.sh">one-word spam classifier</a>.
We improved upon this by considering all words present in a document and arrived at naive Bayes—a simple linear method for classification in which we model each word occurrence independently and use Bayes’ rule to calculate the probability the document belongs to each class given the words it contains.</p>
<p>Although naive Bayes makes an obviously incorrect assumption that all features are independent, it turns out to be a reasonably useful method in practice.
It’s simple and scalable to train, easy to update as new data arrive, easy to interpret, and often more competitive in performance than one might expect.
That said, there are some obvious issues with naive Bayes as presented, namely overfitting in the training process and overconfidence / miscalibration when making predictions.</p>
<p>The first issue arises when thinking about how to estimate word probabilities.
Simple maximum likelihood estimates (MLE) for word probabilities lead to overfitting, implying, for instance, that it’s impossible to see a word in a given class in the future if we’ve never seen it occur in that class in the past.
We dealt with this by thinking about maximum a posteriori (MAP) estimation which led to the idea of <a href="https://en.wikipedia.org/wiki/Additive_smoothing">Laplace smoothing</a>, or adding <a href="http://en.wikipedia.org/wiki/Pseudocount">pseudocounts</a> to empirical word counts to prevent overfitting.
As usual, determining the amount of smoothing to use is an empirical question, often solved by methods such as cross-validation.</p>
<p>As for the second problem of feature independence, we addressed this by abandoning naive Bayes in favor of logistic regression.
Logistic regression makes predictions using the same functional form as naive Bayes—the log-odds are modeled as a weighted combination of feature values—but fits these weights in a manner that accounts for correlations between features.
We (once again) applied the maximum likelihood principle to arrive at criteria for estimating these weights, and discussed gradient descent for a solution. The resulting algorithms are very close in spirit to those for linear regression, but slightly more complex due to the logistic function.
And, similar to linear regression, we discussed the idea of regularizing logistic regression by including a term in the loss function to penalize large weight vectors.</p>
<p>We concluded with a discussion of several metrics for evaluating classifiers, including calibration, <a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrices</a>, accuracy, precision and recall, and the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC curve</a>. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_9/classification.ipynb">classification notebook</a> up on Github for more details.</p>
<p>A few references:</p>
<ul>
<li>Chapter 12 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li>Chapter 4 of <a href="http://www-bcf.usc.edu/~gareth/ISL/getbook.html">An Introduction to Statistical Learning</a></li>
<li><a href="http://www.cs.iastate.edu/~honavar/bayes-lewis.pdf">Naive Bayes at 40</a> by Lewis (1998)</li>
<li><a href="http://www.jstor.org/pss/1403452">Idiots Bayes—Not So Stupid After All?</a> by Hand and Yu (2001)</li>
<li><a href="http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf">A Bayesian Approach to Filtering Junk E-mail</a> from Sahami, Dumais, Heckerman, and Horvitz (1998)</li>
<li><a href="http://www.paulgraham.com/spam.html">A Plan for Spam</a> by Paul Graham (2002)</li>
<li><a href="https://ccrma.stanford.edu/workshops/mir2009/references/ROCintro.pdf">An introduction to ROC analysis</a></li>
<li><a href="http://www.navan.name/roc/">Understanding ROC curves</a></li>
<li><a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki">Vowpal Wabbit</a> for scalable classification</li>
</ul>
Fri, 29 Mar 2019 10:00:00 +0000Homework 3
http://modelingsocialdata.org/homework/2019/03/29/homework-3.html
http://modelingsocialdata.org/homework/2019/03/29/homework-3.html<p>The third homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_3">posted on Github</a>, is due on Thursday, April 11 by 11:59pm ET.</p>
<p>The first problem explores various modeling scenarios, the second looks at cross-validation for polynomial regression, and in the third you’ll use regularized logistic regression to classify New York Times articles. Details are in the README.md file for each problem.</p>
<p>Your code and results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.</p>
Fri, 29 Mar 2019 10:00:00 +0000Lecture 8: Regression, Part 2
http://modelingsocialdata.org/lectures/2019/03/15/lecture-8-regression-2.html
http://modelingsocialdata.org/lectures/2019/03/15/lecture-8-regression-2.html<p>This was the second lecture on the theory and practice of regression, focused on model complexity and generalization.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="8b16d5652bae434e8d478f70bcce6724" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>We started with an applied modeling problem: understanding how internet browsing activity varies by age and gender. We saw that there’s a lot more to modeling than just optimization, with many important steps along the way that range from collecting and specifying outcomes and predictors, to determining the form of a model, to assessing performance and interpreting results. We found that including quadratic terms for age and interacting gender with age gave a reasonable model, at least in terms of visually matching empirical aggregates. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_7/linear_models.ipynb">linear models</a> notebook up on Github for more details.</p>
<p>Then we talked about two high-level points.
First, quantifying model fit and second, knowing when to stop fitting.
In this case, that translates to asking “how good is a quadratic fit” and “why shouldn’t I use a cubic, or quartic, etc.?” or “should I add another interaction?”</p>
<p>To the first point, we discussed <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">root mean squared error (RMSE)</a> and the <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination">coefficient of determination (<script type="math/tex">R^2</script>)</a> as sensible metrics of model fit.
RMSE is just the squared loss function we discussed last time, with a square root to adjust units to match those of the outcome we’re trying to predict.
It’s useful when we already have a sense of absolute scale for “what’s good”.
The coefficient of determination, on the other hand, captures the fraction of variance in outcomes explained by the model, and is useful when we don’t have such a scale or are comparing across different problems.
We showed that this is the same as comparing the mean squared error (MSE) of the model to the MSE of a simple baseline where we always predict the average outcome.
Finally, we discussed the connection between <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson’s correlation coefficient</a> and <script type="math/tex">R^2</script>.
See <a href="https://economictheoryblog.com/2014/11/05/proof/">here</a> for a proof that the latter is in fact the square of the former. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_8/model_evaluation.ipynb">model evaluation</a> notebook on Github for details.</p>
<p>Applying both of these metrics to the pageview dataset, we saw that while there were systematic trends in typical viewing behavior by age and gender, there was still a surprisingly large amount of variation in individual activity for people of the same age and gender.</p>
<p>This led us to our second high-level topic, the question of complexity control: How complicated should we make our model?
We discussed the idea of generalization error, and how we’d like models that are both complex enough to account for the past and simple enough to predict the future.
Cross-validation is the most common approach to navigating this tradeoff, where we divide our data into a training set for fitting models, a validation set for comparing these different fits, and a test set that’s used once (and <em>only once</em>) to quote the expected future performance of the model we end up selecting.
We talked about <a href="https://www.youtube.com/watch?v=TIgfjmp-4BA">k-fold cross-validation</a> as a more statistically robust version of estimating generalization error. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_8/complexity_Control.ipynb">complexity control</a> notebook on Github for details.</p>
<p>We also phrased this issue in terms of the <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">bias-variance tradeoff</a>.
Simple models are likely biased in that they systematically misrepresent the world, and would do so even with an infinite amount of data.
At the same time, estimating a simple model is a low variance procedure in that our results don’t change substantially when we fit it on different samples of data.
More flexible models, on the other hand, have little bias and can capture more complex patterns in the world.
The downside is that this flexibility also renders such models sensitive to noise, often leading to high variance, or drastically different results with different samples of the data.</p>
<p>We concluded lecture with a brief discussion of <a href="https://en.wikipedia.org/wiki/Regularization_%28mathematics%29">regularization</a> as a way of modifying loss functions to improve the generalization error of our models by explicitly balancing the fit to the training data with the “complexity” of the model.
The idea is that introducing some bias in our models is sometimes a good idea if the corresponding reduction in variance is enough to lower the mean squared error.</p>
<p>See Github for an <a href="http://localhost:8888/notebooks/lecture_8/intro_to_glmnet.ipynb">introduction to glmnet</a> as well as this interactive Shiny App to explore <a href="https://jmhmsr.shinyapps.io/regularization/">regularization</a>.</p>
<p>References:</p>
<ul>
<li>Chapter 2 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> on the bias-variance tradeoff</li>
<li>Section 1.4 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on the same, with a more detailed derivation
<!-- http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf --></li>
<li>Chapter 5 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> and 3 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on resampling and cross-validation</li>
<li>Recent work on using differentially private mechanisms for <a href="https://research.googleblog.com/2015/08/the-reusable-holdout-preserving.html">reusing holdout sets</a></li>
<li>Chapters <a href="http://r4ds.had.co.nz/model-basics.html">23</a> and <a href="http://r4ds.had.co.nz/model-building.html">24</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>The <a href="https://modelr.tidyverse.org">modelr</a> and <a href="https://github.com/tidymodels/tidymodels">tidymodels</a> packages in R</li>
<li>The <a href="https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html">glmnet vignette</a></li>
</ul>
Fri, 15 Mar 2019 10:00:00 +0000Lecture 7: Regression, Part 1
http://modelingsocialdata.org/lectures/2019/03/08/lecture-7-regression-1.html
http://modelingsocialdata.org/lectures/2019/03/08/lecture-7-regression-1.html<p>This was the first of two lectures on the theory and practice of regression.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="199594cffb524787a7bced446593789a" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>In the first part of class we shifted from talking about problems in how science is often done to best practices for doing good science. We went through the pipeline of designing a study, piloting and revising it, doing a power calculation, pre-registering the study, running it, creating a reproducible analysis and report, and thinking critically about the results.</p>
<p>Next we moved on to regression.
We started with a high-level overview of regression, which can be broadly defined as any analysis of how one continuous variable (the “outcome”) changes with others (the “inputs”, “predictors”, or “features”).
The goals of a regression analysis can vary, from describing the data at hand, to predicting new outcomes, to explaining the associations between outcomes and predictors.
This includes everything from looking at histograms and scatter plots to building statistical models.</p>
<p>We focused on the latter and discussed ordinary least squares regression.
First, we motivated this as an optimization problem and then connected squared loss minimization to the more general principle of maximum likelihood.
Then we discussed several ways to solve this optimization problem to estimate coefficients for a linear model, which are summarized in the table below.</p>
<table>
<thead>
<tr>
<th>Method</th>
<th style="text-align: center">Space</th>
<th style="text-align: center">Time</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invert normal equations</td>
<td style="text-align: center"><script type="math/tex">N K + K^2</script></td>
<td style="text-align: center"><script type="math/tex">K^3</script></td>
<td>Good for medium-sized datasets with a relatively small number (e.g., hundreds or thousands) of features</td>
</tr>
<tr>
<td>Gradient descent</td>
<td style="text-align: center"><script type="math/tex">N K</script></td>
<td style="text-align: center"><script type="math/tex">NK</script> per step</td>
<td>Good for larger datasets that still fit in memory but have more (e.g., millions) features; requires tuning learning rate</td>
</tr>
<tr>
<td>Stochastic gradient descent</td>
<td style="text-align: center"><script type="math/tex">K</script></td>
<td style="text-align: center"><script type="math/tex">K</script> per step</td>
<td>Good for datasets that exceed available memory; more sensitive to learning rate schedule</td>
</tr>
</tbody>
</table>
<p>See also this interactive Shiny App to explore <a href="https://jmhmsr.shinyapps.io/modelfit/">manually fitting a simple model</a> and this notebook by Jongbin Jung with <a href="http://jakehofman.com/gd/">an animation of gradient descent</a>.</p>
<!--
In the second half of class we looked at fitting linear models in R, with an application to understanding how internet browsing activity varies by age and gender.
See the [Jupyter notebook](https://github.com/jhofman/msd2017/blob/master/lectures/lecture_6/linear_models.ipynb) up on Github for more details.
The main lesson here is that there's more to modeling than just optimization, with many important steps along the way that range from collecting and specifying outcomes and predictors, to determining the form of a model, to assessing performance and interpreting results.
-->
<p>References:</p>
<ul>
<li>Chapter 3 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a></li>
<li>Chapters 1 and 2 of <a href="http://www.stat.cmu.edu/%7Ecshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li>Chapter 5 of OpenIntro’s <a href="https://www.openintro.org/stat/textbook.php">Introductory Statistics with Randomization and Simulation</a></li>
<li><a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/statistical-models-theory-and-practice-2nd-edition?format=PB">Statistical Models</a> by David Freedman</li>
<li><a href="https://us.sagepub.com/en-us/nam/regression-analysis/book226138">Regression Analysis</a> by Richard Berk</li>
<li>Chapters <a href="http://r4ds.had.co.nz/model-basics.html">23</a> and <a href="http://r4ds.had.co.nz/model-building.html">24</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
</ul>
Fri, 08 Mar 2019 10:00:00 +0000Lecture 6: Reproducibility and replication, Part 2
http://modelingsocialdata.org/lectures/2019/03/01/lecture-6-reproducibility-2.html
http://modelingsocialdata.org/lectures/2019/03/01/lecture-6-reproducibility-2.html<p>This was our second lecture on reproducibility and replication in which we discussed false discoveries, effect sizes, and p-hacking / researcher degrees of freedom.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="ce73cc7b18114447b75619411419bd76" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>The <a href="/lectures/2019/02/22/lecture-5-reproducibility-1.html">previous lecture</a> provided a high-level overview of the ongoing replication crisis in the sciences. In this lecture we continued the discussion, first by talking about false discoveries. Following Felix Schönbrodt’s excellent <a href="http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/">blog post</a>, we talked about how underpowered studies lead to false discoveries. Then we went on to discuss <a href="https://transparentstats.github.io/guidelines/effectsize.html">effect sizes</a>, specifically <a href="https://en.wikipedia.org/wiki/Effect_size#Cohen's_d">Cohen’s d</a> and the <a href="https://en.wikipedia.org/wiki/Effect_size#Common_language_effect_size">AUC</a>, through this excellent <a href="https://rpsychologist.com/d3/cohend/">visual tool</a>.</p>
<p>Next we spoke about <a href="https://en.wikipedia.org/wiki/Post_hoc_analysis">post-hoc data analysis</a> and <a href="https://en.wikipedia.org/wiki/Data_dredging">p-hacking</a>. We looked at the <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">False-Positive Psychology</a> paper by Simmons, Nelson & Simonsohn, which has an illustrative example of how one can arrive at non-sensical conclusions if there’s enough flexibility in data collection and analysis. Gelman and Loken’s <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">The Garden of Forking Paths</a> makes a similar point, noting that this can often occur without mal intent on the part of the researcher. While these issues are complex, there are few best practices (e.g., running pilot studies followed by <a href="https://aspredicted.org">pre-registration</a> of high-powered, large-scale experiments) that can help mitigate these concerns.
<a href="http://www.sciencemag.org/careers/2015/12/register-your-study-new-publication-option">Registered reports</a> are a particularly attractive solution, wherein researchers write up and submit an experimental study for peer review <em>before</em> the study is conducted. Reviewers make an acceptance decision at this point based on the merit of the study, and, if accepted, it is published regardless of the results. We also discussed how these ideas that come largely from randomized experiments might be adapted for observational studies.</p>
<p>We finished up class by talking about a few tools for computational reproducibility, specifically <a href="https://rmarkdown.rstudio.com">RMarkdown</a> for reproducible documents and <a href="https://bost.ocks.org/mike/make/">Makefiles</a> for efficient workflows. Example files are up
<a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_6">on Github</a>.</p>
<p>References:</p>
<ul>
<li>A guide on <a href="https://transparentstats.github.io/guidelines/effectsize.html">effect sizes</a> and related <a href="https://transparentstatistics.org/2018/07/05/meanings-effect-size/">blog post</a></li>
<li><a href="https://rpsychologist.com/d3/cohend/">Interpreting Cohen’s d effect size</a></li>
<li><a href="https://journals.sagepub.com/doi/pdf/10.1177/0956797613504966">The New Statistics: Why and How</a> by Cummings</li>
<li><a href="https://www.jstor.org/stable/3802789?seq=1#metadata_info_tab_contents">The Insignificance of Significance Testing</a> by Johnson</li>
<li><a href="https://journals.sagepub.com/doi/abs/10.1177/106591299905200309">The Insignificance of Null Hypothesis Significance Testing</a> by Gill</li>
<li><a href="http://journals.plos.org/plosmedicine/article/file?id=10.1371/journal.pmed.0020124&type=printable">Why Most Published Research Findings Are False</a></li>
<li>Felix Schönbrodt’s <a href="http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/">blog post</a> and
<a href="http://shinyapps.org/apps/PPV/">shiny app</a> on misconceptions about p-values and false discoveries</li>
<li><a href="http://www.cyclismo.org/tutorial/R/power.html">Calculating the power of a test</a></li>
<li><a href="http://www.nature.com/nrn/journal/v14/n5/pdf/nrn3475.pdf">Power failure: why small sample size undermines the reliability of neuroscience</a> by Button, et. al.</li>
<li><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">False-Positive Psychology</a> by Simmons, Nelson & Simonsohn</li>
<li><a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">The garden of forking paths</a> by Gelman & Loken</li>
<li><a href="https://www.cambridge.org/core/journals/psychological-medicine/article/cumulative-effect-of-reporting-and-citation-biases-on-the-apparent-efficacy-of-treatments-the-case-of-depression/71D73CADE32C0D3D996DABEA3FCDBF57/core-reader">The cumulative effect of reporting and citation biases on the apparent efficacy of treatments</a> by de Vries et al. (<a href="https://www.nytimes.com/2018/09/24/upshot/publication-bias-threat-to-science.html?em_pos=small&emc=edit_up_20180924&nl=upshot&nl_art=0&nlid=57978065emc%3Dedit_up_20180924&ref=headline&te=1">popular coverage</a>)</li>
<li>Pre-registration portals from the <a href="https://osf.io/registries/">Open Science Framework</a>, <a href="https://cos.io/prereg/">Center for Open Science</a>, and <a href="https://aspredicted.org/index.php">AsPredicted.org</a></li>
<li>Science magazine’s announcement of <a href="http://www.sciencemag.org/careers/2015/12/register-your-study-new-publication-option">registered reports</a></li>
<li><a href="https://bost.ocks.org/mike/make/">Why Use Make</a> by Mike Bostock</li>
<li><a href="http://zmjones.com/make/">GNU Make for Reproducible Data Analysis</a></li>
<li><a href="https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></li>
<li><a href="https://rmarkdown.rstudio.com/">RStudio’s RMarkdown site</a></li>
<li>The <a href="https://bookdown.org/yihui/rmarkdown/">RMarkdown: The Definitive Guide</a> book</li>
</ul>
<!-- measures of effect size rosenthal https://books.google.com/books?hl=en&lr=&id=p-aFAwAAQBAJ&oi=fnd&pg=PA231&dq=parametric+measure+of+effect+size+rosenthal&ots=TVzKQfiJTJ&sig=JwandSbd84lwhv0BeK0O9FX8k70#v=onepage&q&f=false -->
Fri, 01 Mar 2019 10:00:00 +0000Homework 2
http://modelingsocialdata.org/homework/2019/02/28/homework-2.html
http://modelingsocialdata.org/homework/2019/02/28/homework-2.html<p>The second homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_2">posted on Github</a>, is due on Thursday, March 14 by 11:59pm ET.</p>
<p>The first problem looks at the link between coffee and cancer, the second problem examines an experiment on whether yawning is contagious, and the third problem involves replicating the results of a paper about the Google ngram dataset. Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.</p>
Thu, 28 Feb 2019 17:00:00 +0000Lecture 5: Reproducibility and replication, Part 1
http://modelingsocialdata.org/lectures/2019/02/22/lecture-5-reproducibility-1.html
http://modelingsocialdata.org/lectures/2019/02/22/lecture-5-reproducibility-1.html<p>We discussed the ongoing <a href="https://en.wikipedia.org/wiki/Replication_crisis">replication crisis</a> in the sciences, wherein it has proven difficult or impossible for researchers to independently verify results of previously published studies.</p>
<script async="" class="speakerdeck-embed" data-id="8c1dd50c57e14f26b3a9c8fbc9837376" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<p>We started off the lecture by talking about how to evaluate research findings. Namely, how can you assess whether the results of a study are believable and/or important?</p>
<p>We took the optimistic view that most researchers are honest, although there are <a href="https://en.wikipedia.org/wiki/List_of_scientific_misconduct_incidents">some exceptions</a>. For instance, a recent study by <a href="http://science.sciencemag.org/content/346/6215/1366.full">LaCour and Green</a> reported that a single conversation with canvassers had lasting impact on support for gay marriage. But soon after the study was published, Broockman, Kalla, and Aronow found <a href="http://stanford.edu/~dbroock/broockman_kalla_aronow_lg_irregularities.pdf">some irregularities</a> in the data. The paper was later <a href="http://www.sciencemag.org/news/2015/05/science-retracts-gay-marriage-paper-without-agreement-lead-author-lacour">retracted</a> on account of the data being fabricated using the results of a previous study. Broockman and Kalla then proceeded to carry out <a href="http://science.sciencemag.org/content/352/6282/220">their own version</a> of such a study, and ironically found <a href="https://www.wired.com/2016/04/political-sciences-whistleblowers-rebunk-gay-canvassing-study/">support for the original hypothesis</a>.</p>
<p>While such instantces of fraud are rare, there are other, more common concerns among published studies. The first is <em>reproducibility</em>, or whether one can independently verify the results of a study with the same data and same code used in the original paper. Though a low bar, most research currently doesn’t pass this test simply because it’s often the case that papers are published without all of the supporting data or code. And when the data and code are available, the code can be surprisingly difficult to understand or run, especially when there are complex software dependencies. This is improving as researchers adopt better software engineering practices and develop <a href="http://science.sciencemag.org/content/354/6317/1240.full">guildelines</a>, <a href="http://www.rctatman.com/files/2018-7-14-MLReproducability.pdf">best practices</a>, and <a href="https://medium.com/@michel.steuwer/artifact-review-and-badging-855dc11b64a0">incentives</a> for reproducibility.</p>
<p>Next we discussed <em>replicability</em>, or whether a result holds when a study is repeated with new data but the same analysis as the original paper. The main issue here is that it’s easy to be fooled by randomness because noise can dominate signal in small datasets and asking too many questions of the data can lead to overfitting, even with large datasets. We looked at a seminal paper from the <a href="https://osf.io/vmrgu/">Open Science Collaboration</a>, <a href="http://science.sciencemag.org/content/349/6251/aac4716.full">Estimating the reproducibility of psychological science</a>, which conducts replications of 100 published psychology studies and finds that roughly a third replicate, often with smaller effect sizes than reported in the original studies.</p>
<p>This led us to a review of frequentist statistics, which although somewhat of a <a href="https://www.mpib-berlin.mpg.de/pubdata/gigerenzer/Gigerenzer_2018_Statistical_rituals.pdf">statistical ritual</a>, is still an <a href="https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.XE8wl89KjRY">important one to understand</a>, for better or worse. A short quiz on the topic highlighted that it’s easy for newcomers and trained professionals alike to <a href="https://link.springer.com/article/10.1007%2Fs10654-016-0149-3">misunderstand</a> the meaning of p-values, hypothesis tests, and statistical significance. We reviewed null hypothesis testing through the lens of simulation, in contrast to the usual textbook approach of learning a battery of parametric tests.</p>
<p>At a high-level, null hypothesis testing asks “how (un)likely are the data I observed under a certain (null) model of the world”? If the data are sufficiently unlikely, we can reject this null model, otherwise our test is inconclusive. The catch is that we have to quantify what consititutes “sufficiently unlikely” and we have to make sure our experiment is actually powerful enough to reject the null when it’s false. In the Neyman-Pearson framework, we make choices based on the long-run error rates we’re willing to tolerate if this procedure is repeated over and over again. While this is usually taught using a reasonable amount of fancy math, we instead discussed it using brute force simulation, which allowed us to focus on the concepts instead of formulas and recipes. The basic idea is simple: if we’d like to know what to expect if the null model is actually true, we can just simulate many such experiments assuming it’s true, look at the distribution of outcomes, and compare what we actually see in the world to the results of our simulations. More details are in this notebook on
<a href="http://htmlpreview.github.io/?https://github.com/jhofman/msd2019/blob/master/lectures/lecture_5/statistical_inference.html">simulation-based statistical inference</a> and the <a href="https://github.com/jhofman/msd2019-notes/tree/master/lecture_5">scribed notes</a>.</p>
<p>We’ll continue this discussion of statistics, reproducibility, replication, and evaluating research next week.</p>
<p>References:</p>
<ul>
<li><a href="http://science.sciencemag.org/content/354/6317/1240.full">Enhancing reproducibility for computational methods</a> by Stodden et al.</li>
<li><a href="http://www.rctatman.com/files/2018-7-14-MLReproducability.pdf">A Practical Taxonomy of Reproducibility for Machine Learning Research</a> by Tatman, VanderPlas & Dane</li>
<li>A post on <a href="https://medium.com/@michel.steuwer/artifact-review-and-badging-855dc11b64a0">ACM’s Artifact Review and Badging</a></li>
<li><a href="http://science.sciencemag.org/content/349/6251/aac4716.full">Estimating the reproducibility of psychological science</a> from the Open Science Collaboration</li>
<li><a href="https://www.mpib-berlin.mpg.de/pubdata/gigerenzer/Gigerenzer_2018_Statistical_rituals.pdf">Statistical Rituals: The Replication Delusion and How We Got There</a> by Gigerenzer</li>
<li>The American Statistical Association’s <a href="https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.XE8wl89KjRY">statement on p-values</a> by Wasserstein & Lazar</li>
<li><a href="https://link.springer.com/article/10.1007%2Fs10654-016-0149-3">Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations</a> by Greenland et al.</li>
<li><a href="https://seeing-theory.brown.edu">Seeing Theory</a>, a visual, simulation-based tour of statistics</li>
<li>Chapters 12 and 13 of <a href="http://pluto.huji.ac.il/%7Emsby/StatThink/index.html">Introduction to Statistical Thinking (With R, Without Calculus)</a></li>
<li><a href="https://www.openintro.org/stat/textbook.php">Introductory Statistics with Randomization and Simulation</a></li>
<li>Statistics for Hackers by VanderPlas (<a href="https://speakerdeck.com/jakevdp/statistics-for-hackers">slides</a>, <a href="https://www.youtube.com/watch?v=Iq9DzN6mvYA">video</a>)</li>
</ul>
Fri, 22 Feb 2019 10:10:00 +0000Lecture 4: Data Visualization
http://modelingsocialdata.org/lectures/2019/02/15/lecture-4-data-visualization.html
http://modelingsocialdata.org/lectures/2019/02/15/lecture-4-data-visualization.html<p>We used this lecture to discuss <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_3/intro_to_r.ipynb">data manipulation</a> and <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_4/visualization_with_ggplot2.ipynb">data visualization </a> in R, specifically focusing on <a href="https://dplyr.tidyverse.org"><code class="highlighter-rouge">dplyr</code></a> and <a href="https://ggplot2.tidyverse.org"><code class="highlighter-rouge">ggplot2</code></a> from the <a href="http://tidyverse.org"><code class="highlighter-rouge">tidyverse</code></a>.</p>
<script async="" class="speakerdeck-embed" data-id="4540923077774710a34ba80dfc9c4dd5" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<p>The <code class="highlighter-rouge">tidyverse</code> relies on data being in a “tidy” format of one observation per row, one variable per column, and one value per cell. It provides tools for getting untidy data (of which there’s lots) into a tidy format. Once data are in this format, it provides tools for chaining together a string of commands, similar to unix pipes, that makes it very easy to translate ideas and question in your mind into working and readable code. This allows you to spend more time exploring and understanding your data and less time debugging code.</p>
<script async="" class="speakerdeck-embed" data-id="5bf041357fc24ff5b9cef83713baed0e" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<p>We discussed visualization as a way to better understand data and as a way of communicating readers. We briefly reviewed experiments by <a href="http://www.jstor.org/stable/2288400?seq=1#page_scan_tab_contents">Cleveland and McGill</a> showing that not all visual encodings are created equal, <a href="http://dl.acm.org/citation.cfm?id=22950">Mackinlay’s</a> expressiveness / effectiveness tradeoff, and <a href="https://en.wikipedia.org/wiki/Leland_Wilkinson">Wilkinson’s</a> grammar of graphics. We spent a good amount of time discussing how every visualization should convey a point, preferrably one that can be summarized by a short sentence. These data visualization slides are generously adapted from <a href="http://hci.stanford.edu/~cagatay/">Çağatay Demiralp</a>.</p>
<p>Source code for the examples we reviewed are available on the course Github page: <a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_3">data manipulation</a>, <a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_4">data visualization</a>.</p>
<p>There are <a href="https://pinboard.in/u:jhofman/t:r/t:tutorials/">lots of R resources</a> available on the web, but here are a few highlights:</p>
<ul>
<li><a href="http://tryr.codeschool.com">CodeSchool</a> and <a href="https://www.datacamp.com/courses/free-introduction-to-r">DataCamp</a> intro to R courses</li>
<li>More about <a href="http://www.r-tutor.com/r-introduction/basic-data-types">basic types</a> (numeric, character, logical, factor) in R</li>
<li>Vectors, lists, dataframes: a <a href="http://www.statmethods.net/input/datatypes.html">one page reference</a> and [more details]</li>
<li>Chapters <a href="http://r4ds.had.co.nz/introduction.html">1</a>, <a href="http://r4ds.had.co.nz/explore-intro.html">2</a>, and <a href="http://r4ds.had.co.nz/transform.html">5</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/dplyr-data-manipulation-r-tutorial">Data Manipulation in R</a> tutorial</li>
<li>The <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html">dplyr vignette</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/2014/09/13/dplyr-intro.html">dplyr and pipes examples</a> (<a href="https://github.com/seananderson/dplyr-intro-2014">code</a> on github)</li>
<li>Rstudio’s <a href="http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">data wrangling cheatsheet</a></li>
<li>Hadley Wickham’s <a href="http://bit.ly/splitapplycombine">split/apply/combine</a> paper</li>
<li>The <a href="https://style.tidyverse.org">tidyverse style guide</a></li>
<li>Chapters <a href="http://r4ds.had.co.nz/data-visualisation.html">3</a>, <a href="http://r4ds.had.co.nz/exploratory-data-analysis.html">7</a>, and <a href="http://r4ds.had.co.nz/graphics-for-communication.html">28</a> in <a href="http://r4ds.had.co.nz/">R for Data Science</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/">Data Visualization with ggplot2</a> tutorial</li>
<li>Videos on <a href="http://varianceexplained.org/RData/lessons/lesson2/">Visualizing Data with ggplot2</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/courses/12-ggplot2/ggplot2_slides_with_examples.pdf">ggplot2 slides</a> (<a href="(http://github.com/seananderson/datawranglR)">code</a>) for more examples</li>
<li>RStudio’s <a href="https://www.rstudio.com/resources/cheatsheets/">cheatsheets</a></li>
</ul>
Fri, 15 Feb 2019 10:10:00 +0000Lecture 3: Computational complexity
http://modelingsocialdata.org/lectures/2019/02/08/lecture-3-computational-complexity.html
http://modelingsocialdata.org/lectures/2019/02/08/lecture-3-computational-complexity.html<p>We had a guest lecture from <a href="http://sidsen.org/">Sid Sen</a> on computational complexity and algorithm analysis.</p>
<p><img src="http://modelingsocialdata.org/img/runtime_table.png" alt="Algorithm runtime in seconds, from Kleinberg & Tardos" /></p>
<p>Sid discussed various ways of analyzing how long algorithms take to run, focusing on worst-case analysis.
We discussed <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/asymptotic-notation">asymptotic notation</a> (<a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation">big-O</a> for upper bounds, <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-omega-notation">big-omega</a> for lower bounds, and <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-theta-notation">big-theta</a> for tight bounds).
The table above, from <a href="https://www.pearsonhighered.com/program/Kleinberg-Algorithm-Design/PGM319216.html">Algorithm Design</a> by Kleinberg and Tardos, shows how long we should expect different algorithms to run on modern hardware.
The key takeaway is that knowing how to match the right algorithm to your dataset is important.
For instance, when you’re dealing with millions of observations, only linear (or maybe <a href="https://en.wikipedia.org/wiki/Time_complexity#Linearithmic_time">linearithmic</a>) time algorithms are practical.</p>
<p>A few other references:</p>
<ul>
<li>A <a href="https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/">beginner’s guide</a> to big-O notation</li>
<li>Another <a href="https://www.interviewcake.com/article/python/big-o-notation-time-and-space-complexity">introduction to big-O</a></li>
<li>The <a href="http://bigocheatsheet.com/">big-O cheatsheet</a></li>
</ul>
<p>We touched upon a few more advanced topics around the tradeoff between how long something takes to run and how much space it requires. Sid gave a brief overview of <a href="https://brilliant.org/wiki/skip-lists/">skip lists</a> and mentioned some more recent work by his advisor, Robert Tarjan, on <a href="https://arxiv.org/abs/1806.06726v2">zip trees</a> (video lecture <a href="https://www.youtube.com/watch?v=NxRXhBur6Xs">here</a>).</p>
<p>Sid finished his lecture by discussing how this applies to something as simple as taking the intersection of two lists, useful for <a href="https://en.wikipedia.org/wiki/Join_(SQL)">joining</a> different tables.
A naive approach of comparing all pairs of elements takes quadratic time.
It’s relatively easy to do much better by <a href="https://en.wikipedia.org/wiki/Sort-merge_join">sorting and merging</a> the two sets, reducing this to <code class="highlighter-rouge">n log(n)</code> time.
And if we’re willing to trade space for time, we can use a <a href="https://en.wikipedia.org/wiki/Hash_table">hash table</a> to get the job done in linear time, known as a <a href="https://en.wikipedia.org/wiki/Hash_join">hash join</a>.</p>
<p>We used the end of lecture to revisit the command line and finish up a few leftover topics. See <a href="/lectures/2019/02/01/lecture-2-counting.html">last week’s post</a> for links to code from class.</p>
<!--
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/ejmirP42ECxx3f" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe>
</center>
-->
<p>Next week we’ll discuss data manipulation in R. In preparation, make sure to <a href="/homework/2019/01/24/installing-tools.html">set up</a> R and the <a href="https://www.tidyverse.org">tidyverse</a> packages. If you’re new to R, in addition to the readings in R4DS book, check out the <a href="http://tryr.codeschool.com">CodeSchool</a> and <a href="https://www.datacamp.com/courses/free-introduction-to-r">DataCamp</a> intro to R courses. Also have a look at the slides and code we’ll discuss in class next week, which are <a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_3">up on github</a>.</p>
Fri, 08 Feb 2019 00:00:00 +0000Homework 1
http://modelingsocialdata.org/homework/2019/02/07/homework-1.html
http://modelingsocialdata.org/homework/2019/02/07/homework-1.html<p>The first homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_1">posted on Github</a>, is due on Thursday, February 21 by 11:59pm ET.</p>
<p>The first problem explores various counting techniques, the second involves some command line and R counting exercises, and the third looks at the impact of inventory size on customer satisfaction for the MovieLens data.
Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Thu, 07 Feb 2019 17:00:00 +0000Lecture 2: Introduction to Counting
http://modelingsocialdata.org/lectures/2019/02/01/lecture-2-counting.html
http://modelingsocialdata.org/lectures/2019/02/01/lecture-2-counting.html<p>Counting is surprisingly useful for understanding and summarizing social data. The key is figuring out what to count and how to count it efficiently.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="0c088c1b50e44966a74c52e0b331995e" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>While it’s a seemingly simple concept, counting can be quite challenging in practice, especially when dealing with large, multi-dimensional data.</p>
<p>First we discussed simple counting and uncertainty in the context of polling. We used <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_2/flip_coins.ipynb">simulations</a> to determine how large of a poll to conduct to stay within a given <a href="https://en.wikipedia.org/wiki/Margin_of_error#Calculations_assuming_random_sampling">margin of error</a>. In practice, there are many <a href="https://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html">sources of uncertainty</a> in polling, which can often lead to much <a href="https://www.nytimes.com/2016/10/06/upshot/when-you-hear-the-margin-of-error-is-plus-or-minus-3-percent-think-7-instead.html">larger margins of error</a> than these results imply. See Chapters 5 and 6 of <a href="http://pluto.huji.ac.il/~msby/StatThink/index.html">Intro to Statistical Thinking (With R, Without Calculus)</a> for background on binomial random variables and sampling distributions.</p>
<p>Then we discussed the <a href="http://bit.ly/splitapplycombine">split/apply/combine</a> paradigm for counting and applied it to several examples from <a href="http://5harad.com/papers/long_tail.pdf">The Anatomy of the Long Tail</a>.
We also looked at alternative models for counting that trade off flexibility for scalability, such as <a href="http://en.wikipedia.org/wiki/Streaming_algorithm">streaming algorithms</a>.
Streaming allows us to compute statistics such as the mean or <a href="http://www.johndcook.com/blog/standard_deviation/">variance</a> without having to read all of the data into memory first.
We summarized these approaches and compared the types of statistics that can be computed under various conditions.</p>
<p>We concluded with an <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_2/intro_command_line.ipynb">introduction to the command line</a>, including some simple counting and exploration of the <a href="https://www.citibikenyc.com/system-data">CitiBike trip data</a>. Additional command line references and tutorials can be found in the <a href="/homework/2019/01/24/installing-tools.html">installing tools</a> post. All code and slides are on the <a href="https://github.com/jhofman/msd2019">course Github page</a>.</p>
Fri, 01 Feb 2019 00:00:00 +0000Lecture 1: Overview
http://modelingsocialdata.org/lectures/2019/01/25/lecture-1-overview.html
http://modelingsocialdata.org/lectures/2019/01/25/lecture-1-overview.html<p>We used our first lecture to look at case studies in four main areas: exploratory data analysis, classification, regression, and working with network data.</p>
<script async="" class="speakerdeck-embed" data-id="e74aaadf1779487c901c6bf3a2701902" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<!--
<iframe src="/slides/2019/lecture_1.pdf" width="100%" height="550px"></iframe>
<embed src="/slides/2019/lecture_1.pdf" type="application/pdf" width="100%" height="550px" internalinstanceid="8">
<object data="/slides/2019/lecture_1.pdf" type="application/pdf" width="100%" height="550px" style="overflow:scroll" >
<p>It appears you don't have a PDF plugin for this browser. You can <a href="/slides/2019/lecture_1.pdf">click here to
download the PDF file.</a></p>
</object>
-->
<p>We discussed a few examples, including using aggregate search activity to <a href="http://www.pnas.org/content/107/41/17486.full.pdf">predict consumer behavior</a>, exploring browsing logs to understand how Internet usage <a href="http://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewFile/4660/4975">varies across demographic groups</a>, and analyzing the structure of information cascades to understand <a href="https://5harad.com/papers/twiral.pdf">how content spreads online</a>.</p>
<p>During this discussion, we touched on how easy it is to find <a href="http://www.tylervigen.com">spurious correlations</a>, <a href="http://hunch.net/?p=22">cheat at prediction</a>, and be <a href="https://en.wikipedia.org/wiki/Data_dredging">fooled by randomness</a>.</p>
Fri, 25 Jan 2019 00:00:00 +0000Installing tools
http://modelingsocialdata.org/homework/2019/01/24/installing-tools.html
http://modelingsocialdata.org/homework/2019/01/24/installing-tools.html<p>This class will involve a good deal of coding, for which you will need some basic tools. Please make sure to set up the following tools after the first day of class.</p>
<h3 id="an-interactive-bash-shell">An interactive <a href="http://www.gnu.org/software/bash/">bash</a> shell</h3>
<p>This will give you the ability to interact with your filesystem via the command line instead of a GUI such as Windows Explorer or Mac Finder. We will also use bash to automate acquiring and cleaning data sets.</p>
<p>If you use Windows, you can try the <a href="http://www.howtogeek.com/249966/how-to-install-and-use-the-linux-bash-shell-on-windows-10/">builtin bash/Ubuntu</a> shell on Windows 10 or you can <a href="https://cygwin.com/install.html">install Cygwin</a> which includes bash and a terminal application by default. Mac OS X includes a bash shell by default, and a terminal application in <code class="highlighter-rouge">/Applications/Utilities</code>. Linux also includes a working shell and terminal.</p>
<p>Verify that your environment is properly configured by typing the following commands indicated after the <code class="highlighter-rouge">#</code> symbol. You should see something similar (although not necessarily identical) to the following:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># echo $SHELL
/bin/bash
# grep --version
grep (BSD grep) 2.5.1-FreeBSD
# cut
usage: cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-s] [-d delim] [file ...]
</code></pre></div></div>
<p>If you’re new to the command line, see Codecademy’s <a href="https://www.codecademy.com/courses/learn-the-command-line/lessons/navigation/exercises/your-first-command?action=lesson_resume">interactive tutorial</a>, this <a href="https://learnpythonthehardway.org/book/appendixa.html">crash course</a>, and Software Carpentry’s <a href="http://swcarpentry.github.io/shell-novice/">guide</a>.
Lifehacker’s <a href="http://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything">command line primer</a> is also decent.</p>
<p>O’Reilly’s <a href="http://shop.oreilly.com/product/9780596005955.do">Classic Shell Scripting</a> book is a more complete reference.</p>
<!-- https://github.com/veltman/clmystery -->
<h3 id="a-git-client">A <a href="http://git-scm.com">Git</a> client</h3>
<p>Git is a version control system that allows you to track modifications to files and code over time. It also facilitates collaborations so that multiple people can share and edit the same code base.</p>
<p>If you are on Windows you can install <a href="https://windows.github.com">Github for Windows</a> which provides both the command line tool for git and a graphical user interface. Alternatively, you can install git as an optional package under Cygwin. We recommend the Github application, as it will be easier to interface with Github using it. Likewise, modern versions of Mac OS X have a command line git client installed by default, but the <a href="https://mac.github.com">Github for Mac</a> tool is a recommended addition. Linux users can install git with the appropriate package manager (e.g., <code class="highlighter-rouge">yum install git</code> on RedHat or <code class="highlighter-rouge">apt-get install git</code>), and there are a number of different <a href="http://unix.stackexchange.com/questions/144100/is-there-a-usable-gui-front-end-to-git-on-linux">git GUIs for Linux</a>.</p>
<p>Complete this relatively brief <a href="https://www.codeschool.com/courses/try-git">interactive tour of git</a>. See this <a href="http://rogerdudler.github.io/git-guide/">one page guide</a> for explanations of the usual git workflow and most common commands, or <a href="http://kbroman.org/github_tutorial/">here</a> for a more verbose guide. Github also has an <a href="https://www.youtube.com/watch?v=U8GBXvdmHT4">introductory video</a>, some <a href="https://services.github.com/resources/">training courses</a>, and a handy <a href="https://services.github.com/on-demand/resources/cheatsheets/">cheatsheet</a>.</p>
<h3 id="a-github-account">A <a href="http://github.com">Github</a> account</h3>
<p>Github is a platform that facilitates collaboration on projects that use git. You can use it to host projects, publish them to the web, and share them with other people. <a href="https://help.github.com/articles/signing-up-for-a-new-github-account/">Create a free account</a> if you don’t already have one.</p>
<p>Once you have an account, clone the <a href="https://github.com/jhofman/msd2019">course repository</a> using your local git client. This is most easily done on the command line as follows:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code># git clone https://github.com/jhofman/msd2019.git
Cloning into 'msd2019'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 6 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), done.
</code></pre></div></div>
<p>When this is complete, verify that you have a local directory called <code class="highlighter-rouge">msd2019</code> containing a <code class="highlighter-rouge">README.md</code> file.</p>
<!-- https://happygitwithr.com -->
<h3 id="r-and-rstudio">R and RStudio</h3>
<p>R is a useful programming language for exploratory data analysis, data visualization, and statistical modeling. RStudio is a popular integrated development environment (IDE) for working in R.</p>
<p>First, download and install R from a <a href="https://cloud.r-project.org/">CRAN mirror</a>. Then download Rstudio from <a href="https://www.rstudio.com/products/rstudio/download/">here</a>. Finally, install and load some important packages as follows:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>install.packages('tidyverse')
library(tidyverse)
</code></pre></div></div>
<p>If you’re new to R, see the <a href="http://tryr.codeschool.com/">Code School</a> and <a href="http://datacamp.com/courses/free-introduction-to-r">DataCamp</a> online tutorials.</p>
<p>We will discuss all of these tools in more detail in class.</p>
Thu, 24 Jan 2019 00:00:00 +0000