http://modelingsocialdata.org/
Tue, 25 Apr 2017 01:29:42 +0000Homework 3
http://modelingsocialdata.org/homework/2017/04/13/homework-3.html
http://modelingsocialdata.org/homework/2017/04/13/homework-3.html<p>The third homework assignment, <a href="https://github.com/jhofman/msd2017/tree/master/homework/homework_3">posted on Github</a>, is due on Monday, April 24 by 11:59pm ET.</p>
<p>The first problem looks at logistic regression for text classification, the second explores the small-world phenomenon in “close” vs. “distant” friend networks, and the third studies how the structure of an email network changes as we remove weak ties from it.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/sis_course_id:APMAE4990_001_2017_1">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Thu, 13 Apr 2017 00:00:00 +0000Lectures 10 & 11: Networks
http://modelingsocialdata.org/lectures/2017/04/07/lectures-10-11-networks.html
http://modelingsocialdata.org/lectures/2017/04/07/lectures-10-11-networks.html<p>We spent these two lectures discussing network data, including a whirlwhind tour of the history of network theory, representations and characteristics of networks, and algorithms for analyzing network data.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/4wtOi0tDYzVPPs" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>People have studied <em>theoretical</em> problems on and properties of graphs for a long time, but only in the last few decades have we had access to <em>real network data</em>, such as online social networks or the topology of the Internet.
When these data became available, it quickly became clear that real networks looked quite different than well-studied theoretical models (e.g., <a href="http://en.wikipedia.org/wiki/Erdős–Rényi_model">Erdős–Rényi</a> random graphs).
For example, many real networks have highly <a href="http://en.wikipedia.org/wiki/Complex_network#Scale-free_networks">skewed degree distributions</a>, reflecting the fact that most people in a social network have few friends while only a few people have many friends.
At the same time, social networks typically have <a href="http://en.wikipedia.org/wiki/Small-world_network">short path lengths</a>, in the sense that one needs only to traverse a handful of links to connect a randomly selected set of people in the network.</p>
<p>After discussing many different types of networks that we might analyze as well as the various levels of abstraction available for representing them, we turned to algorithms for efficiently computing shortest path lengths, connected components, mutual friends, and clustering coefficients.</p>
<p>We started with the problem of finding the shortest distance between a single source node and all other nodes in a (undirected, unweighted) network, as measured by the fewest number of edges you need to traverse to get from the source to every other node.
(Every researcher’s favorite version of this is computing their <a href="http://en.wikipedia.org/wiki/Erdős_number">Erdős number</a>, the academic take on the more well-known <a href="http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon">Kevin Bacon game</a>. Compute yours <a href="http://academic.research.microsoft.com">here</a>.)</p>
<p>Breadth first search (BFS) provides a nice solution.
The intuition behind BFS is simple: we start from the source node and mark it as distance zero from itself.
Then we visit each of its neighbors and mark those as distance one.
We repeat this iteratively, pushing forward a boundary of recently discovered nodes that are one additional hop from the source at each step.
BFS visits each node and edge in a network once, scaling linearly in the size of the network.
If, however, we would like to find the shortest distance between <em>all pairs</em> of nodes then we must repeat this for each possible source node, and so this quickly becomes prohibitively expensive for even moderately sized networks.
(See <a href="http://en.wikipedia.org/wiki/Shortest_path_problem#All-pairs_shortest_paths">here</a> for fancier, more efficient algorithms.)</p>
<p>Next we looked at using BFS for a related problem: finding the number of <a href="http://en.wikipedia.org/wiki/Connected_component_(graph_theory)">connected components</a>, or separate pieces, of a network.
We did this by simply looping over our shortest path code, seeding it on each iteration with a currently unreachable node as the source until we reach all nodes.
We gave the reachable nodes in each BFS a unique label corresponding to its component.</p>
<p>Then we moved on to computing the number of friends that any two nodes have in common, motivated by the problem of friend recommendations on social networks.
The underlying idea can be traced back to <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">Granovetter</a>: two people are likely to know each other if they have many mutual friends.
To compute the number of mutual friends between all pairs of nodes, we exploit the fact that the neighbors of every node share that node as a common friend.
To count all mutual friends we simply loop over each node and increment a counter for every pair of its neighbors.
For each node this scales as the square of its degree, so the whole algorithm scales as the sum of the squared degrees of all nodes.
This can quickly become expensive if we have even a few high-degree nodes, which are quite common in practice.</p>
<p>Finally, we looked at the closely related problem of counting the number of triangles around each node in a network.
This algorithm is nearly identical to computing mutual friends, as we generate the same set of two-hop paths through all pairs of a node’s neighbors, but simply increment different counters to generate different results.
Instead of accumulating mutual friends for each pair of a node’s neighbors, we ask whether every pair of neighbors are themselves directly connected.
If so, we count this as (half of) a triangle in which the node participates.
Dividing the number of closed triangles in a network by the number of possible triangles that could be present gives a useful for how <a href="http://en.wikipedia.org/wiki/Clustering_coefficient">clustered</a> a network is.</p>
<p>To better understand properties of networks and how to compute them, we looked at a few example networks in R using the <code class="highlighter-rouge">igraph</code> package.
See the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_10">notebooks</a> on the course GitHub page for related code and data used in the lectures.</p>
<p>References:</p>
<ul>
<li>Chapters 2, 18, and 20 of Easley and Kleinberg’s <a href="http://www.cs.cornell.edu/home/kleinber/networks-book/">Networks, Crowds, and Markets</a></li>
<li>Granovetter’s <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">Strength of Weak Ties</a> paper</li>
<li>de Solla Price on <a href="http://garfield.library.upenn.edu/papers/pricenetworks1965.pdf">citation networks</a> and <a href="http://garfield.library.upenn.edu/price/pricetheory1976.pdf">cumulative advantage</a></li>
<li><a href="https://www.math.cornell.edu/m/sites/default/files/imported/People/strogatz/nature_smallworld.pdf">Collective dynamics of ‘small-world’ networks</a> by Watts & Strogatz</li>
<li><a href="http://web.stanford.edu/~jugander/papers/websci12-fourdegrees.pdf">Four degrees of separation</a>: scaling up calculations to the entire Facebook social graph</li>
<li><a href="http://www.rebennack.net/SEA2011/files/talks/SEA2011_Pajor.pdf">Customizable route planning</a>: how shortest path calculations are done in modern mapping applications</li>
<li>These <a href="https://berkeleydatascience.files.wordpress.com/2012/03/20120320berkeley.pdf">slides</a> on the early system for friend recommendation on Facebook (pages 28 to 37)</li>
</ul>
<!--
BFS computes shortest path: http://www.cs.toronto.edu/~krueger/cscB63h/lectures/BFS.pdf
BFS runtime and correctness: http://www.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL06.ps
[MapReduce for networks](http://jakehofman.com/icwsm2010/slides.html)
https://github.com/jhofman/icwsm2010_tutorial
[Curse of the last reducer](http://theory.stanford.edu/~sergei/papers/www11-triangles.pdf)
[Model of MapReduce](http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf)
[Facebook at scale](http://arxiv.org/abs/1111.4503)
-->
Fri, 07 Apr 2017 10:00:00 +0000Lectures 8 & 9: Classification
http://modelingsocialdata.org/lectures/2017/03/24/lectures-8-9-classification.html
http://modelingsocialdata.org/lectures/2017/03/24/lectures-8-9-classification.html<p>This post covers two lectures on classification, the first a guest lecture from <a href="http://www.columbia.edu/~chw2/">Chris Wiggins</a>.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/n0IHNlWKh5z0Di" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>Chris opened his guest lecture by introducing the problem of classification, where the outcome is categorical (e.g., whether an email is spam or <a href="https://wiki.apache.org/spamassassin/Ham">ham</a>) rather than continuous.
We first reviewed <a href="http://en.wikipedia.org/wiki/Bayes'_rule">Bayes’ rule</a> for inverting conditional probabilities via a simple, but <a href="http://bit.ly/ggbbc">somewhat counterintuitive</a>, <a href="http://www.scientificamerican.com/article/what-is-bayess-theorem-an/">medical diagnosis example</a> and then adapted this to an (extremely naive) <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_8/enron_naive_bayes.sh">one-word spam classifier</a>.
We improved upon this by considering all words present in a document and arrived at naive Bayes—a simple linear method for classification in which we model each word occurrence independently and use Bayes’ rule to calculate the probability the document belongs to each class given the words it contains.
Chris concluded with a unifying overview of various loss functions and derived <a href="https://en.wikipedia.org/wiki/Boosting_%28machine_learning%29">boosting</a> under expotential loss.</p>
<p>Although naive Bayes makes an obviously incorrect assumption that all features are independent, it turns out to be a reasonably useful method in practice.
It’s simple and scalable to train, easy to update as new data arrive, easy to interpret, and often more competitive in performance than one might expect.
That said, there are some obvious issues with naive Bayes as presented, namely overfitting in the training process and overconfidence / miscalibration when making predictions.</p>
<p>The first issue arises when thinking about how to estimate word probabilities.
Simple maximum likelihood estimates (MLE) for word probabilities lead to overfitting, implying, for instance, that it’s impossible to see a word in a given class in the future if we’ve never seen it occur in that class in the past.
We dealt with this by thinking about maximum a posteriori (MAP) estimation which led to the idea of <a href="https://en.wikipedia.org/wiki/Additive_smoothing">Laplace smoothing</a>, or adding <a href="http://en.wikipedia.org/wiki/Pseudocount">pseudocounts</a> to empirical word counts to prevent overfitting.
As usual, determining the amount of smoothing to use is an empirical question, often solved by methods such as cross-validation.</p>
<p>As for the second problem of feature independence, we addressed this by abandoning naive Bayes in favor of logistic regression.
Logistic regression makes predictions using the same functional form as naive Bayes—the log-odds are modeled as a weighted combination of feature values—but fits these weights in a manner that accounts for correlations between features.
We (once again) applied the maximum likelihood principle to arrive at criteria for estimating these weights, and discussed gradient descent and <a href="http://en.wikipedia.org/wiki/Newton's_method">Newton’s methods</a> for solutions.
The resulting algorithms are very close in spirit to those for linear regression, but slightly more complex due to the logistic function.
And, similar to linear regression, we discussed the idea of regularizing logistic regression by including a term in the loss function to penalize large weight vectors.</p>
<p>We concluded with a discussion of several metrics for evaluating classifiers, including calibration, <a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrices</a>, accuracy, precision and recall, and the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC curve</a>.</p>
<p>A few references:</p>
<ul>
<li>Chapter 12 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li>Chapter 4 of <a href="http://www-bcf.usc.edu/~gareth/ISL/getbook.html">An Introduction to Statistical Learning</a></li>
<li><a href="http://www.cs.iastate.edu/~honavar/bayes-lewis.pdf">Naive Bayes at 40</a> by Lewis (1998)</li>
<li><a href="http://www.jstor.org/pss/1403452">Idiots Bayes—Not So Stupid After All?</a> by Hand and Yu (2001)</li>
<li><a href="http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf">A Bayesian Approach to Filtering Junk E-mail</a> from Sahami, Dumais, Heckerman, and Horvitz (1998)</li>
<li><a href="http://www.paulgraham.com/spam.html">A Plan for Spam</a> by Paul Graham (2002)</li>
<li><a href="https://ccrma.stanford.edu/workshops/mir2009/references/ROCintro.pdf">An introduction to ROC analysis</a></li>
</ul>
Fri, 24 Mar 2017 10:00:00 +0000Homework 2
http://modelingsocialdata.org/homework/2017/03/14/homework-2.html
http://modelingsocialdata.org/homework/2017/03/14/homework-2.html<p>The second homework assignment, <a href="https://github.com/jhofman/msd2017/tree/master/homework/homework_2">posted on Github</a>, is due on Monday, March 27 by 11:59pm ET.</p>
<p>The first problem explores various modeling scenarios, the second looks at cross-validation for polynomial regression, and the third involves fitting and interpreting a model of supermarket sales data.
Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/sis_course_id:APMAE4990_001_2017_1">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Tue, 14 Mar 2017 20:00:00 +0000Lecture 7: Regression, Part 2
http://modelingsocialdata.org/lectures/2017/03/03/lecture-7-regression-2.html
http://modelingsocialdata.org/lectures/2017/03/03/lecture-7-regression-2.html<p>This was the second lecture on the theory and practice of regression, focused on model complexity and generalization.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/AO2fqTF50kBrOb" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>We started by revisiting the pageview prediction problem from last lecture.
Last time we worked on constructing a model that captured some of the trends in typical browsing activity as a function of gender and age.
We saw that including quadratic terms for age and interacting this with age gave a reasonable model, at least in terms of visually matching empirical aggregates.
This time we talked about two high-level points.
First, quantifying model fit and second, knowing when to stop fitting.
In the setting above, this translates to asking “how good is a quadratic fit” and “why shouldn’t I use a cubic, or quartic, etc.?”</p>
<p>To the first point, we discussed <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">root mean squared error (RMSE)</a> and the <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination">coefficient of determination (<script type="math/tex">R^2</script>)</a> as sensible metrics of model fit.
RMSE is just the squared loss function we discussed last time, with a square root to adjust units to match those of the outcome we’re trying to predict.
It’s useful when we already have a sense of absolute scale for “what’s good”.
The coefficient of determination, on the other hand, captures the fraction of variance in outcomes explained by the model, and is useful when we don’t have such a scale or are comparing across different problems.
We showed that this is the same as comparing the mean squared error (MSE) of the model to the MSE of a simple baseline where we always predict the average outcome.
Finally, we discussed the connection between <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson’s correlation coefficient</a> and <script type="math/tex">R^2</script>.
See <a href="https://economictheoryblog.com/2014/11/05/proof/">here</a> for a proof that the latter is in fact the square of the former.</p>
<p>Applying both of these metrics to the pageview dataset, we saw that while there were systematic trends in typical viewing behavior by age and gender, there was still a surprisingly large amount of variation in individual activity for people of the same age and gender.</p>
<p>This led us to our second high-level topic, the question of complexity control: How complicated should we make our model?
We discussed the idea of generalization error, and how we’d like models that are both complex enough to account for the past and simple enough to predict the future.
Cross-validation is the most common approach to navigating this tradeoff, where we divide our data into a training set for fitting models, a validation set for comparing these different fits, and a test set that’s used once (and <em>only once</em>) to quote the expected future performance of the model we end up selecting.
We talked about <a href="https://www.youtube.com/watch?v=TIgfjmp-4BA">k-fold cross-validation</a> as a more statistically robust version of estimating generalization error.</p>
<p>We also phrased this issue in terms of the <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">bias-variance tradeoff</a>.
Simple models are likely biased in that they systematically misrepresent the world, and would do so even with an infinite amount of data.
At the same time, estimating a simple model is a low variance procedure in that our results don’t change substantially when we fit it on different samples of data.
More flexible models, on the other hand, have little bias and can capture more complex patterns in the world.
The downside is that this flexibility also renders such models sensitive to noise, often leading to high variance, or drastically different results with different samples of the data.</p>
<p>We concluded lecture with a brief discussion of <a href="https://en.wikipedia.org/wiki/Regularization_%28mathematics%29">regularization</a> as a way of modifying loss functions to improve the generalization error of our models by explicitly balancing the fit to the training data with the “complexity” of the model.
The idea is that introducing some bias in our models is sometimes a good idea if the corresponding reduction in variance is enough to lower the mean squared error.</p>
<p>Code from the lecture is up
<a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_7/">on Github</a>.
Also see this interactive Shiny App to explore <a href="https://jmhmsr.shinyapps.io/regularization/">regularization</a>.</p>
<p>References:</p>
<ul>
<li>Chapter 2 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> on the bias-variance tradeoff</li>
<li>Section 1.4 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on the same, with a more detailed derivation
<!-- http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf --></li>
<li>Chapter 5 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> and 3 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on resampling and cross-validation</li>
<li>Recent work on using differentially private mechanisms for <a href="https://research.googleblog.com/2015/08/the-reusable-holdout-preserving.html">reusing holdout sets</a></li>
</ul>
Fri, 03 Mar 2017 10:00:00 +0000Lecture 6: Regression, Part 1
http://modelingsocialdata.org/lectures/2017/02/24/lecture-6-regression-1.html
http://modelingsocialdata.org/lectures/2017/02/24/lecture-6-regression-1.html<p>This was the first of two lectures on the theory and practice of regression.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/M3UPic6Yfewant" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>We started with a high-level overview of regression, which can be broadly defined as any analysis of how one continuous variable (the “outcome”) changes with others (the “inputs”, “predictors”, or “features”).
The goals of a regression analysis can vary, from describing the data at hand, to predicting new outcomes, to explaining the associations between outcomes and predictors.
This includes everything from looking at histograms and scatter plots to building statistical models.</p>
<p>We focused on the latter and discussed ordinary least squares regression.
First, we motivated this as an optimization problem and then connected squared loss minimization to the more general principle of maximum likelihood.
Then we discussed several ways to solve this optimization problem to estimate coefficients for a linear model, which are summarized in the table below.</p>
<table>
<thead>
<tr>
<th>Method</th>
<th style="text-align: center">Space</th>
<th style="text-align: center">Time</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invert normal equations</td>
<td style="text-align: center"><script type="math/tex">N K + K^2</script></td>
<td style="text-align: center"><script type="math/tex">K^3</script></td>
<td>Good for medium-sized datasets with a relatively small number (e.g., hundreds or thousands) of features</td>
</tr>
<tr>
<td>Gradient descent</td>
<td style="text-align: center"><script type="math/tex">N K</script></td>
<td style="text-align: center"><script type="math/tex">NK</script> per step</td>
<td>Good for larger datasets that still fit in memory but have more (e.g., millions) features; requires tuning learning rate</td>
</tr>
<tr>
<td>Stochastic gradient descent</td>
<td style="text-align: center"><script type="math/tex">K</script></td>
<td style="text-align: center"><script type="math/tex">K</script> per step</td>
<td>Good for datasets that exceed available memory; more sensitive to learning rate schedule</td>
</tr>
</tbody>
</table>
<p>See also this interactive Shiny App to explore <a href="(https://jmhmsr.shinyapps.io/modelfit/)">manually fitting a simple model</a> and this notebook by Jongbin Jung with <a href="http://jakehofman.com/gd/">an animation of gradient descent</a>.</p>
<p>In the second half of class we looked at fitting linear models in R, with an application to understanding how internet browsing activity varies by age and gender.
See the <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_6/linear_models.ipynb">Jupyter notebook</a> up on Github for more details.
The main lesson here is that there’s more to modeling than just optimization, with many important steps along the way that range from collecting and specifying outcomes and predictors, to determining the form of a model, to assessing performance and interpreting results.</p>
<p>References:</p>
<ul>
<li>Chapters <a href="http://r4ds.had.co.nz/model-basics.html">23</a> and <a href="http://r4ds.had.co.nz/model-building.html">24</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>Chapter 3 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a></li>
<li>Chapters 1 and 2 of <a href="http://www.stat.cmu.edu/%7Ecshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li><a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/statistical-models-theory-and-practice-2nd-edition?format=PB">Statistical Models</a> by David Freedman</li>
<li><a href="https://us.sagepub.com/en-us/nam/regression-analysis/book226138">Regression Analysis</a> by Richard Berk</li>
</ul>
Fri, 24 Feb 2017 10:00:00 +0000Lecture 5: Data Visualization
http://modelingsocialdata.org/lectures/2017/02/17/lecture-5-data-visualization.html
http://modelingsocialdata.org/lectures/2017/02/17/lecture-5-data-visualization.html<p>We had a guest lecture from <a href="http://hci.stanford.edu/~cagatay//">Çağatay Demiralp</a> on data visualization.</p>
<center>
<iframe src="https://docs.google.com/viewer?srcid=0B-M9UEiE6KFAWmtvUjQta0RFNkk&pid=explorer&efh=false&a=v&chrome=false&embedded=true" width="476px" height="400px" frameborder="0" marginwidth="0" marginheight="0"></iframe>
</center>
<!--
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/7VTVGmJRVcQ1Ln" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
-->
<p>Çağatay discussed both the principles and practice of data visualization, starting with historical examples of <a href="https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak">John Snow’s visualization</a> of cholera outbreaks and <a href="https://en.wikipedia.org/wiki/Florence_Nightingale#/media/File:Nightingale-mortality.jpg">Florence Nightingale’s infographic</a> on causes of death in the army.
He emphasized Stuart Card’s point that visualizations represent data in a way that <a href="https://books.google.com/books?id=wdh2gqWfQmgC&lpg=PP1&dq=Readings%20in%20Information%20Visualization%3A%20Using%20Vision%20to%20Think&pg=PA15#v=onepage&q=amplify%20cognition&f=false">amplifies cognition</a>, making it easier to see patterns in data, a point nicely illustrated by <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe’s Quartet</a>.</p>
<p>We discussed the perceptual aspects of visualizations, including <a href="https://en.wikipedia.org/wiki/Stevens%27_power_law">Stevens’ Power Law</a>, and experiments by <a href="http://www.jstor.org/stable/2288400?seq=1#page_scan_tab_contents">Cleveland and McGill</a> showing that not all visual encodings are created equal, and that the best encoding depends on the type of data being visualized.
He closed with a discussion of different data visualization tools, including <a href="http://dl.acm.org/citation.cfm?id=22950">Mackinlay’s</a> expressiveness / effectiveness tradeoff and <a href="https://en.wikipedia.org/wiki/Leland_Wilkinson">Wilkinson’s</a> grammar of graphics.</p>
<p>In the second part of class we look at <code class="highlighter-rouge">ggplot2</code>, Hadley Wickham’s popular implementation of Wilkinson’s grammar of graphics.
We focused on using <code class="highlighter-rouge">ggplot2</code> to effectively communicate information through visualizations.
Every visualization should convey a point, preferrably one that can be summarized by a short sentence.
This <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_5/visualization_with_ggplot2.ipynb">Jupyter notebook</a> provides an intro to <code class="highlighter-rouge">ggplot2</code>, detailing how the choices we make in the visualization process affect the messages our plots and figures convey.</p>
<p>Readings and references:</p>
<ul>
<li>Chapters <a href="http://r4ds.had.co.nz/data-visualisation.html">3</a>, <a href="http://r4ds.had.co.nz/exploratory-data-analysis.html">7</a>, and <a href="http://r4ds.had.co.nz/graphics-for-communication.html">28</a> in <a href="http://r4ds.had.co.nz/">R for Data Science</a></li>
<li>RStudio’s <a href="https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf">ggplot2 cheatsheet</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/">Data Visualization with ggplot2</a> tutorial</li>
<li>Videos on <a href="http://varianceexplained.org/RData/lessons/lesson2/">Visualizing Data with ggplot2</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/courses/12-ggplot2/ggplot2_slides_with_examples.pdf">ggplot2 slides</a> (<a href="(http://github.com/seananderson/datawranglR)">code</a>) for more examples</li>
<li>The <a href="http://docs.ggplot2.org/current/">official ggplot2 docs</a></li>
</ul>
Fri, 17 Feb 2017 10:10:00 +0000Lecture 4: Counting at Scale
http://modelingsocialdata.org/lectures/2017/02/10/lecture-4-counting-at-scale.html
http://modelingsocialdata.org/lectures/2017/02/10/lecture-4-counting-at-scale.html<p>In this lecture we discussed combining and reshaping data in R as well as counting at scale with MapReduce.</p>
<p>First we extended last week’s discussion of data manipulation in R by looking at the various joins (inner, left, full, and anti) for combining different tables available in <code class="highlighter-rouge">dplyr</code>.
Then we used the <code class="highlighter-rouge">tidyr</code> package to reshape data that comes in inconvenient formats (e.g., from long to wide with <code class="highlighter-rouge">spread</code>, or vice versa with <code class="highlighter-rouge">gather</code>).</p>
<p>See this <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_4/combine_and_reshape_in_r.ipynb">Jupyter notebook</a> for more details.
Additional readings include <a href="http://r4ds.had.co.nz/tidy-data.html">Chapter 12</a> of <a href="http://r4ds.had.co.nz/">R for Data Science</a> for <code class="highlighter-rouge">tidyr</code> and <a href="http://r4ds.had.co.nz/relational-data.html">Chapter 13</a> for joins.
There are also useful vignettes for <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html">two-table verbs</a> in <code class="highlighter-rouge">dplyr</code> and <a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">tidy data</a> with <code class="highlighter-rouge">tidyr</code>.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/7VTVGmJRVcQ1Ln" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>In the second half of class we talked about counting at scale with <a href="http://research.google.com/archive/mapreduce.html">MapReduce</a>.
At its core, MapReduce is a distributed system for solving the split/apply/combine problem at scale, essentially functioning as a distributed group-by operation.
The programmer implements a <code class="highlighter-rouge">map</code> function, which defines how records should be split in to groups and a <code class="highlighter-rouge">reduce</code> function that defines what to compute within each group.
The system takes care of the rest of the complex engineering details, from distributed storage to fault tolerance, in a manner that makes the parallelism virtually transparent to the programmer.</p>
<p><a href="http://hadoop.apache.org/">Hadoop</a> is a popular open source implementation of the MapReduce paradigm.
We discussed how <a href="https://hadoop.apache.org/docs/r1.2.1/streaming.html">Hadoop Streaming</a> can be used to scale existing code, and briefly looked at higher-level languages that abstract away some low-level MapReduce details from the programmer.
For instance, <a href="http://pig.apache.org">Pig</a> is a high-level language that converts sequences of common data analysis operations (e.g., filter, sort, join, group by, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster.
<a href="http://hive.apache.org">Hive</a> is similar, but follows the <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> paradigm more closely.</p>
<p>See this <a href="https://vgc.poly.edu/~juliana/courses/cs6093/Readings/dean-cacm2008.pdf">CACM article</a> and <a href="http://infolab.stanford.edu/~ullman/mmds/ch2.pdf">Chapter 2</a> of <a href="http://mmds.org/">Mining Massive Data Sets</a> for more on MapReduce.
Michael Noll also has a nice <a href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">tutorial</a>.
And code for the wordcount example we covered in class is on the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_4">course Github page</a>.</p>
Fri, 10 Feb 2017 10:10:00 +0000Homework 1
http://modelingsocialdata.org/homework/2017/02/10/homework-1.html
http://modelingsocialdata.org/homework/2017/02/10/homework-1.html<p>The first homework assignment, <a href="https://github.com/jhofman/msd2017/tree/master/homework/homework_1">posted on Github</a>, is due on Thursday, February 23 by 11:59pm ET.</p>
<p>The first problem explores various counting techniques, the second involves some command line and R counting exercises, and the third looks at the impact of inventory size on customer satisfaction for the MovieLens data.
Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/sis_course_id:APMAE4990_001_2017_1">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Fri, 10 Feb 2017 08:00:00 +0000Lecture 3: Computational complexity
http://modelingsocialdata.org/lectures/2017/02/03/lecture-3-computational-complexity.html
http://modelingsocialdata.org/lectures/2017/02/03/lecture-3-computational-complexity.html<p>We had a guest lecture from <a href="http://sidsen.org/">Sid Sen</a> on computational complexity and algorithm analysis.</p>
<p><img src="http://modelingsocialdata.org/img/runtime_table.png" alt="Algorithm runtime in seconds, from Kleinberg & Tardos" /></p>
<p>Sid discussed various ways of analyzing how long algorithms take to run, focusing on worst-case analysis.
We discussed <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/asymptotic-notation">asymptotic notation</a> (<a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation">big-O</a> for upper bounds, <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-omega-notation">big-omega</a> for lower bounds, and <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-theta-notation">big-theta</a> for tight bounds).
The table above, from <a href="https://www.pearsonhighered.com/program/Kleinberg-Algorithm-Design/PGM319216.html">Algorithm Design</a> by Kleinberg and Tardos, shows how long we should expect different algorithms to run on modern hardware.
The key takeaway is that knowing how to match the right algorithm to your dataset is important.
For instance, when you’re dealing with millions of observations, only linear (or maybe <a href="https://en.wikipedia.org/wiki/Time_complexity#Linearithmic_time">linearithmic</a>) time algorithms are practical.</p>
<p>A few other references:</p>
<ul>
<li>A <a href="https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/">beginner’s guide</a> to big-O notation</li>
<li>Another <a href="https://www.interviewcake.com/article/python/big-o-notation-time-and-space-complexity">introduction to big-O</a></li>
<li>The <a href="http://bigocheatsheet.com/">big-O cheatsheet</a></li>
</ul>
<p>Sid finished his lecture by discussing how this applies to something as simple as taking the intersection of two lists, useful for <a href="https://en.wikipedia.org/wiki/Join_(SQL)">joining</a> different tables.
A naive approach of comparing all pairs of elements takes quadratic time.
It’s relatively easy to do much better by <a href="https://en.wikipedia.org/wiki/Sort-merge_join">sorting and merging</a> the two sets, reducing this to <code class="highlighter-rouge">n log(n)</code> time.
And if we’re willing to trade space for time, we can use a <a href="https://en.wikipedia.org/wiki/Hash_table">hash table</a> to get the job done in linear time, known as a <a href="https://en.wikipedia.org/wiki/Hash_join">hash join</a>.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/ejmirP42ECxx3f" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
</center>
<p>We used the second half of lecture to discuss <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_3/intro_to_r.ipynb">data manipulation in R</a>, specifically focusing on using <a href="https://github.com/hadley/dplyr"><code class="highlighter-rouge">dplyr</code></a> from the <a href="http://tidyverse.org"><code class="highlighter-rouge">tidyverse</code></a> for a convenient implementation of the split / apply / combine framework.</p>
<p>We started this lecture with a brief tour of using the <a href="http://www.rstudio.com">RStudio</a> IDE.
In particular, we focused on <a href="http://had.co.nz">Hadley Wickham’s</a> latest tool, <code class="highlighter-rouge">dplyr</code> (<a href="http://cran.r-project.org/web/packages/dplyr/index.html">CRAN</a>, <a href="https://github.com/hadley/dplyr">GitHub</a>), which provides a particularly nice implementation of the <a href="http://bit.ly/splitapplycombine">split/apply/combine</a> paradigm.
Source code is available on the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_3">course GitHub page</a>.</p>
<p>There are <a href="https://pinboard.in/u:jhofman/t:r/t:tutorials/">lots of R resources</a> available on the web, but here are a few highlights:</p>
<ul>
<li><a href="http://tryr.codeschool.com">CodeSchool</a> and <a href="https://www.datacamp.com/courses/free-introduction-to-r">DataCamp</a> intro to R courses</li>
<li>More about <a href="http://www.r-tutor.com/r-introduction/basic-data-types">basic types</a> (numeric, character, logical, factor) in R</li>
<li>Vectors, lists, dataframes: a <a href="http://www.statmethods.net/input/datatypes.html">one page reference</a> and [more details]</li>
<li>Chapters <a href="http://r4ds.had.co.nz/introduction.html">1</a>, <a href="http://r4ds.had.co.nz/explore-intro.html">2</a>, and <a href="http://r4ds.had.co.nz/transform.html">5</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/dplyr-data-manipulation-r-tutorial">Data Manipulation in R</a> tutorial</li>
<li>The <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">dplyr vignette</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/2014/09/13/dplyr-intro.html">dplyr and pipes examples</a> (<a href="https://github.com/seananderson/dplyr-intro-2014">code</a> on github)</li>
<li>Rstudio’s <a href="http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">data wrangling cheatsheet</a></li>
<li>Hadley Wickham’s <a href="http://adv-r.had.co.nz/Style.html">R style guide</a></li>
</ul>
Fri, 03 Feb 2017 00:00:00 +0000Lecture 2: Introduction to Counting
http://modelingsocialdata.org/lectures/2017/01/27/lecture-2-counting.html
http://modelingsocialdata.org/lectures/2017/01/27/lecture-2-counting.html<p>Counting is surprisingly useful for understanding and summarizing social data. The key is figuring out what to count and how to count it efficiently.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/3O721xJmxzHLuh" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
</center>
<p>While it’s a seemingly simple concept, counting can be quite challenging in practice, especially when dealing with large, multi-dimensional data.</p>
<p>We discussed the <a href="http://bit.ly/splitapplycombine">split/apply/combine</a> paradigm for counting and applied it to several examples from <a href="http://5harad.com/papers/long_tail.pdf">The Anatomy of the Long Tail</a>.
We also looked at alternative models for counting that trade off flexibility for scalability, such as <a href="http://en.wikipedia.org/wiki/Streaming_algorithm">streaming algorithms</a>.
Streaming allows us to compute statistics such as the mean or <a href="http://www.johndcook.com/blog/standard_deviation/">variance</a> without having to read all of the data into memory first.
We summarized these approaches and compared the types of statistics that can be computed under various conditions.</p>
<p>We concluded with more work on the command line, including some simple counting and exploration of the <a href="https://www.citibikenyc.com/system-data">CitiBike trip data</a>.
Slides and code including an “Introduction to the command line” <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_2/intro_command_line.ipynb">notebook</a> are available on the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_2/">course github page</a>.</p>
<p>Additional command line references can be found in the <a href="/homework/2017/01/20/installing-tools.html">installing tools</a> post.</p>
Fri, 27 Jan 2017 00:00:00 +0000Lecture 1: Overview
http://modelingsocialdata.org/lectures/2017/01/20/lecture-1-overview.html
http://modelingsocialdata.org/lectures/2017/01/20/lecture-1-overview.html<p>We used our first lecture to look at case studies in four main areas: exploratory data analysis, classification, regression, and working with network data.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/3OAsEKMJjyH2me" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
</center>
<p>We discussed a few examples, including using aggregate search activity to <a href="http://www.pnas.org/content/107/41/17486.full.pdf">predict consumer behavior</a>, exploring browsing logs to understand how Internet usage <a href="http://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewFile/4660/4975">varies across demographic groups</a>, and analyzing the structure of information cascades to understand <a href="https://5harad.com/papers/twiral.pdf">how content spreads online</a>.</p>
<p>During this discussion, we touched on how easy it is to find <a href="http://www.tylervigen.com">spurious correlations</a>, <a href="http://hunch.net/?p=22">cheat at prediction</a>, and <a href="http://www.amazon.com/gp/product/0393310728/">lie with statistics</a>.</p>
Fri, 20 Jan 2017 00:00:00 +0000Installing tools
http://modelingsocialdata.org/homework/2017/01/20/installing-tools.html
http://modelingsocialdata.org/homework/2017/01/20/installing-tools.html<p>This class will involve a good deal of coding, for which you will need some basic tools. Please make sure to set up the following tools after the first day of class.</p>
<h3 id="an-interactive-bash-shell">An interactive <a href="http://www.gnu.org/software/bash/">bash</a> shell</h3>
<p>This will give you the ability to interact with your filesystem via the command line instead of a GUI such as Windows Explorer or Mac Finder. We will also use bash to automate acquiring and cleaning data sets.</p>
<p>If you use Windows, you can try the <a href="http://www.howtogeek.com/249966/how-to-install-and-use-the-linux-bash-shell-on-windows-10/">builtin bash/Ubuntu</a> shell on Windows 10 or you can <a href="https://cygwin.com/install.html">install Cygwin</a> which includes bash and a terminal application by default. Mac OS X includes a bash shell by default, and a terminal application in <code class="highlighter-rouge">/Applications/Utilities</code>. Linux also includes a working shell and terminal.</p>
<p>Verify that your environment is properly configured by typing the following commands indicated after the <code class="highlighter-rouge">#</code> symbol. You should see something similar (although not necessarily identical) to the following:</p>
<div class="highlighter-rouge"><pre class="highlight"><code># echo $SHELL
/bin/bash
# grep --version
grep (BSD grep) 2.5.1-FreeBSD
# cut
usage: cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-s] [-d delim] [file ...]
</code></pre>
</div>
<p>If you’re new to the command line, see Codecademy’s <a href="https://www.codecademy.com/courses/learn-the-command-line/lessons/navigation/exercises/your-first-command?action=lesson_resume">interactive tutorial</a>, this <a href="https://learnpythonthehardway.org/book/appendixa.html">crash course</a>, and Software Carpentry’s <a href="http://swcarpentry.github.io/shell-novice/">guide</a>.
Lifehacker’s <a href="http://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything">command line primer</a> is also decent.</p>
<p>O’Reilly’s <a href="http://shop.oreilly.com/product/9780596005955.do">Classic Shell Scripting</a> book is a more complete reference.</p>
<h3 id="a-git-client">A <a href="http://git-scm.com">Git</a> client</h3>
<p>Git is a version control system that allows you to track modifications to files and code over time. It also facilitates collaborations so that multiple people can share and edit the same code base.</p>
<p>If you are on Windows you can install <a href="https://windows.github.com">Github for Windows</a> which provides both the command line tool for git and a graphical user interface. Alternatively, you can install git as an optional package under Cygwin. We recommend the Github application, as it will be easier to interface with Github using it. Likewise, modern versions of Mac OS X have a command line git client installed by default, but the <a href="https://mac.github.com">Github for Mac</a> tool is a recommended addition. Linux users can install git with the appropriate package manager (e.g., <code class="highlighter-rouge">yum install git</code> on RedHat or <code class="highlighter-rouge">apt-get install git</code>), and there are a number of different <a href="http://unix.stackexchange.com/questions/144100/is-there-a-usable-gui-front-end-to-git-on-linux">git GUIs for Linux</a>.</p>
<p>Complete this relatively brief <a href="https://www.codeschool.com/courses/try-git">interactive tour of git</a>. See this <a href="http://rogerdudler.github.io/git-guide/">one page guide</a> for explanations of the usual git workflow and most common commands, or <a href="http://kbroman.org/github_tutorial/">here</a> for a more verbose guide. Github also has an <a href="https://www.youtube.com/watch?v=U8GBXvdmHT4">introductory video</a>, some <a href="https://services.github.com/training/">training courses</a>, and a handy <a href="https://services.github.com/resources/">cheatsheet</a>.</p>
<h3 id="a-github-account">A <a href="http://github.com">Github</a> account</h3>
<p>Github is a platform that facilitates collaboration on projects that use git. You can use it to host projects, publish them to the web, and share them with other people. <a href="https://help.github.com/articles/signing-up-for-a-new-github-account/">Create a free account</a> if you don’t already have one.</p>
<p>Once you have an account, clone the <a href="https://github.com/jhofman/msd2017">course repository</a> using your local git client. This is most easily done on the command line as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code># git clone https://github.com/jhofman/msd2017.git
Cloning into 'msd2017'...
remote: Counting objects: 145, done.
remote: Compressing objects: 100% (98/98), done.
remote: Total 145 (delta 40), reused 137 (delta 37)
Receiving objects: 100% (145/145), 454.90 KiB | 594.00 KiB/s, done.
Resolving deltas: 100% (40/40), done.
Checking connectivity... done.
</code></pre>
</div>
<p>When this is complete, verify that you have a local directory called <code class="highlighter-rouge">msd2017</code> containing a <code class="highlighter-rouge">README.md</code> file.</p>
<h3 id="r-and-rstudio">R and RStudio</h3>
<p>R is a useful programming language for exploratory data analysis, data visualization, and statistical modeling. RStudio is a popular integrated development environment (IDE) for working in R.</p>
<p>First, download and install R from a <a href="https://cloud.r-project.org/">CRAN mirror</a>. Then download Rstudio from <a href="https://www.rstudio.com/products/rstudio/download/">here</a>. Finally, install and load some important packages as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>install.packages('tidyverse')
library(tidyverse)
</code></pre>
</div>
<p>If you’re new to R, see the <a href="http://tryr.codeschool.com/">Code School</a> and <a href="http://datacamp.com/courses/free-introduction-to-r">DataCamp</a> online tutorials.</p>
<p>We will discuss all of these tools in more detail in class.</p>
Fri, 20 Jan 2017 00:00:00 +0000Lecture 13: Causality and Experiments
http://modelingsocialdata.org/lectures/2015/04/24/experiments.html
http://modelingsocialdata.org/lectures/2015/04/24/experiments.html<p>Most of what we’ve discussed in this class has focused on observational data—data obtained without direct intervention from or manipulation by those studying it.
We can learn a lot from observational data and use it to find interesting relationships or generate hypotheses, but it has it limits.
The most serious of these limits arise when we start thinking about <em>causality</em>: what might look like a causal relationship might in fact be due to <a href="http://www.tylervigen.com/spurious-correlations">spurious correlations</a> or confounding factors.
This is often summarized by catchy phrases such as <a href="http://freakonomics.com/2009/06/30/so-long-and-thanks-for-all-the-f-tests/">“correlation is not causation”</a> or <a href="http://freakonomics.com/2009/06/30/so-long-and-thanks-for-all-the-f-tests/">“no causation without manipulation”</a>.</p>
<p>For a better understanding of why it’s difficult to make causal claims from observational data, let’s say you’re studying the effectiveness of alcoholics anonymous (AA) programs and you manage to find a dataset that provides details on the consumption habits of known alcoholics.
For concreteness, imagine the dataset consists of an indicator for each person that says whether they have participated in AA along with a measure of how long they’ve been sober.
You might be tempted to take this data, look at the distribution of sobriety for the AA members and the non-members, and use this to say something about the effect of the program.</p>
<p>Let’s say you do so and find that AA members tend to stay sober longer than non-members.
Unfortunately this simple observational estimate might drastically overstate the effectiveness of the program, owing to an issue known as <em>selection bias</em>.
For instance, it could be the case that the very people who are more likely to stay sober—those who will make a real effort to do so, regardless of whether they’re in AA or not—are also more likely to join AA.
If this is true, then simply observing that AA members tend to stay sober longer than non-members might tell you more about the type of people who join AA than about the effectiveness of the program itself.
There are even more drastic effects such as <a href="http://en.wikipedia.org/wiki/Simpson's_paradox">Simpson’s paradox</a> where failing to account for confounding factors leads to a directionally incorrect estimate of an effect: what appears to be a positive correlation without adjusting for possible confounds can in fact become a negative one when all available information is accounted for.</p>
<p>The question you’d really like to answer is this: if you cloned each person and sent one copy of that person through the AA program, but not the other, what would the resulting difference in sobriety be?
Short of being able to do this, we could ask a slightly different question: if we had two groups of people who were nearly identical in every way and we sent one group through AA, but not the other, how would the sobriety of the two groups differ?
This is precisely the idea behind <a href="http://en.wikipedia.org/wiki/Randomized_experiment">randomized experiments</a>, such as <a href="http://en.wikipedia.org/wiki/Clinical_trial">clinical trials</a> in medicine and <a href="http://en.wikipedia.org/wiki/A/B_testing">A/B testing</a> for online platforms.
Randomization is key here, as it provides a way of creating two groups that are as similar as possible prior to the treatment (e.g., AA attendance) being administered: if people are randomly assigned to groups, then there shouldn’t be any systematic difference between the two groups.
Since the only difference between the groups is that one gets treated and the other doesn’t, we can ascribe differences in the outcome to the treatment.</p>
<p>After reviewing the high-level ideas behind experiments, we discussed A/B testing in detail.
Similar to last week’s review of statistical inference, we used simulations to look at point estimates, confidence intervals, and hypothesis testing for experiments.
See the <a href="http://rpubs.com/jhofman/ab_testing">Rmarkdown notebook</a> and <a href="http://glinden.blogspot.com/2007/06/ab-testing-at-amazon-and-microsoft.html">course site</a> for more notes.
In particular, there are a number of <a href="http://glinden.blogspot.com/2007/06/ab-testing-at-amazon-and-microsoft.html">practical guides</a> for A/B testing that deal with some of the common issues and pitfalls that arise.</p>
<p>Some additional references:</p>
<ul>
<li>
<p>Three related textbooks: <a href="http://www.cambridge.org/us/academic/subjects/politics-international-relations/research-methods-politics/natural-experiments-social-sciences-design-based-approach">Natural Experiments in the Social Sciences</a>, <a href="http://isps.yale.edu/FEDAI">Field Experiments: Design, Analysis, and Interpretation</a>, <a href="http://www.mostlyharmlesseconometrics.com">Mostly Harmless Econometrics</a></p>
</li>
<li>
<p>Matt Blackwell’s lecture notes on <a href="http://www.mattblackwell.org/files/teaching/s03-potential.pdf">causality and potential outcomes</a> as well as <a href="http://www.mattblackwell.org/files/teaching/s04-experiments.pdf">randomized experiments</a></p>
</li>
<li>
<p>Some notes on <a href="http://andrewgelman.com/2007/12/08/causal_inferenc_2/">causal inference</a> from Andrew Gelman</p>
</li>
<li>
<p><a href="http://developers.lyst.com/data/2014/05/10/bayesian-ab-testing/">Bayesian A/B Testing</a></p>
</li>
<li>
<p>The <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2367103">difficulty</a> of <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2338892">measuring the effects of advertising</a></p>
</li>
<li>
<p>Why to <a href="http://www.vox.com/2015/3/23/8264355/research-study-hype">be skeptical</a> of most medical studies</p>
</li>
<li>
<p>A recent explanation in the New York times of <a href="http://www.nytimes.com/2015/04/07/upshot/alcoholics-anonymous-and-the-challenge-of-evidence-based-medicine.html">instrumental variable techniques</a> for dealing with non-compliance in randomized trails</p>
</li>
</ul>
Fri, 24 Apr 2015 12:00:00 +0000Lecture 12: Networks in MapReduce
http://modelingsocialdata.org/lectures/2015/04/17/networks-in-mapreduce.html
http://modelingsocialdata.org/lectures/2015/04/17/networks-in-mapreduce.html<p>In the first part of lecture we focused on adapting the network algorithms discussed in last week’s class to the MapReduce framework, to handle larger datasets.
When working with a small to medium sized network that fits in memory on a single machine, it’s easy to take for granted that you have instant, random access to the entire network.
Parallel frameworks such as MapReduce allow you to work with larger networks, but they don’t share this feature, which complicates things.
This discussion was based on <a href="http://www.slideshare.net/jakehofman/largescale-social-media-analysis-with-hadoop/70">slides</a> and <a href="https://github.com/jhofman/icwsm2010_tutorial">code</a> from a 2010 tutorial, which gives details for computing degree distributions, clustering coefficients, and shortest paths in MapReduce.</p>
<p>Take, for instance, breadth-first search (BFS).
On a single machine we simply maintain a boundary of discovered nodes and increment the distance to any undiscovered neighbors of the boundary.
In MapReduce, the boundary is likely to be distributed across many different machines, and its neighbors are themselves probably on entirely different machines.
As a result, we have to do a lot of redundant work to implement BFS in MapReduce.
Intuitively, the algorithm works by expanding the boundary one step for every round of MapReduce that’s completed.
In each round, <em>every</em> discovered node sends <em>all</em> of its neighbors an update for its possible distance in the map phase, and each undiscovered node takes on the minimum of all distance updates it receives in the reduce phase.
This allows us to scale BFS to networks that are too large to fit in memory, but can require many rounds of MapReduce for networks with large diameters (e.g., having long chains), which can be expensive in practice.</p>
<p>Similarly, challenges are met in porting node-level clustering coefficient or mutual friends calculations from a single machine to MapReduce.
Here the issue is not running over many rounds, but is due to generating a large amount of intermediate data in the shuffle during one round.
Both of these algorithms rely on generating two-hop paths from adjacency lists.
In MapReduce, this means that every node must effectively tell <em>each</em> of its neighbors about <em>all</em> of its other neighbors, creating a quadratic increase in the intermediate data handled by the shuffle.
This dramatically slows down the shuffle, and introduces what as been termed the <a href="http://theory.stanford.edu/~sergei/papers/www11-triangles.pdf">Curse of the Last Reducer</a> where most tasks execute quickly, but the few unlucky machines that deal with high-degree nodes block progression of the computation.
(This paper contains several clever improvements on the simple two-hop path generation approach.)</p>
<p>All of this serves to point out that while many computations are possible in MapReduce, not all are advisable.
Simply put, in some cases MapReduce just isn’t the right framework, and for parallel network computations this can often be the case.
There is theoretical work that provides a <a href="http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf">Model of MapReduce</a> to formalize these ideas.
There are also alternative frameworks for parallel network algorithms, such as Google’s <a href="http://kowshik.github.io/JPregel/pregel_paper.pdf">Pregel</a> and Apache’s open source clone, <a href="http://giraph.apache.org">Giraph</a>.
Another quite effective option is simply to use a high-memory machine for many of these tasks.</p>
<p>In the second half of lecture, we looked a few prediction problems on networks for which you might use features generated by these network algorithms.
This includes <a href="https://5harad.com/papers/birds.pdf">predicting demographics</a> and <a href="http://www.wwwconference.org/proceedings/www2010/www/p301.pdf">individual behavior</a> by exploiting homophily—or the tendency for people to form ties to others who are similar to them—in social networks.</p>
<p>In preparation for discussing experiments next week, we concluded with a brief overview of statistical inference.
We reviewed point estimates, confidence intervals, and hypothesis testing, all through simulations.
See the <a href="http://rpubs.com/jhofman/statistical_inference">Rmarkdown notebook</a> for simulation-based approaches to statistical inference, also hosted on the <a href="https://github.com/jhofman/msd2015/tree/master/lectures/lecture_12">course GitHub page</a>, as well as Yakir’s excellent (and free) online <a href="http://pluto.huji.ac.il/~msby/StatThink/">Introduction to Statistical Thinking (With R, Without Calculus)</a>.</p>
<!--
BFS computes shortest path: http://www.cs.toronto.edu/~krueger/cscB63h/lectures/BFS.pdf
BFS runtime and correctness: http://www.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL06.ps
[Facebook at scale](http://arxiv.org/abs/1111.4503)
-->
Fri, 17 Apr 2015 12:00:00 +0000