http://modelingsocialdata.org/
Thu, 11 May 2017 19:09:40 +0000Lecture 12: Causality & Experiments, Part 2
http://modelingsocialdata.org/lectures/2017/04/21/lecture-12-causality-and-experiments-2.html
http://modelingsocialdata.org/lectures/2017/04/21/lecture-12-causality-and-experiments-2.html<p>This was our second lecture on causality and experimentation, in which we discussed statistical inference and reproducibility for randomized experiments as well as the design and analysis of natural experiments.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/Evp79egevHRmoJ" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>The previous lecture provided a high-level overview of experimentation, focusing on randomized experiments as the gold standard for causal inference.
In the first part of this lecture we discussed how to reliably design and analyze randomized experiments.
We began with a review of statistical inference, following <a href="http://pluto.huji.ac.il/%7Emsby/StatThink/index.html">Yakir’s approach</a> of using simulations to look at sampling distributions, point estimates, confidence intervals, hypothesis testing, and power calculations.
The basic idea is that you can circumvent a good deal of theory and simulate things directly by repeatedly sampling data to arrive at the usual results for inference and testing.
This has the downside that it’s computationally expensive, but the upside that it presents statistics in a clear, concrete, and practical manner.
See <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_12">here</a> for the code, and the visually appealing <a href="http://students.brown.edu/seeing-theory/">Seeing Theory</a> site for more.</p>
<p>Then we discussed several ways in which randomized experiments can go wrong, including <a href="https://en.wikipedia.org/wiki/Statistical_power">small samples sizes</a>, <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem">multiple hypothesis testing</a>, <a href="https://en.wikipedia.org/wiki/Post_hoc_analysis">post-hoc data analysis</a> and <a href="https://en.wikipedia.org/wiki/Data_dredging">p-hacking</a>.
The combination of these effects has led to a <a href="https://en.wikipedia.org/wiki/Replication_crisis">replication crisis</a> in the social sciences, wherein researchers have found that a number of published experimental findings have failed to replicate in followup studies.
Following Felix Schönbrodt’s excellent <a href="http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/">blog post</a>, we discussed how underpowered studies lead to false discoveries.
While these issues are complex, there are few best practices (e.g., running pilot studies followed by <a href="https://aspredicted.org">pre-registration</a> of high-powered, large-scale experiments) that can help mitigate these concerns.
<a href="http://www.sciencemag.org/careers/2015/12/register-your-study-new-publication-option">Registered reports</a> are a particularly attractive solution, wherein researchers write up and submit an experimental study for peer review <em>before</em> the study is conducted.
Reviewers make an acceptance decision at this point based on the merit of the study, and, if accepted, it is published regardless of the results.
Finally, we briefly touched on ethical and practical concerns of running randomized experiments, looking at <a href="http://www.pnas.org/content/111/24/8788.full.pdf">Facebook’s study of emotional contagion</a> and Kohavi et. al.’s <a href="http://www.exp-platform.com/">practical tips for running A/B tests</a>.</p>
<p>In the second part of lecture we moved on to natural experiments.
We followed <a href="http://www.thaddunning.com/wp-content/uploads/2009/12/Dunning_IEPS_InstrumentalVariables2.pdf">Dunning’s treatment of instrumental variables</a> (IV) by looking at randomized experiments with non-compliance, where there’s a difference between assignment to treatment (e.g., whether you’re told to take a drug) versus receipt of treatment (e.g., whether you actually take it).
The basic idea is that we can estimate two separate quantities: the effect of being assigned a treatment and the odds of actually complying with that assignment.
Dividing the former by the latter provides an estimate of the causal effect of actually receiving the treatment.
Furthermore, we can extend this analysis to situations in which nature provides the randomization instead of a researcher flipping a coin, in which case the source of randomness is referred to as an “instrument” that systematically shifts the probability of being treated.
Classic examples include lotteries or weather events.
We briefly looked an example of the latter in a recent paper that uses random variations in weather to study <a href="https://www.nature.com/articles/ncomms14753">peer effects of exercise</a> in social networks.
We concluded with a discussion about the benefits and limitations of traditional approaches to finding and arguing for valid instruments, and looked at an example of <a href="http://jakehofman.com/inprint/amazonrecs.pdf">data-driven approaches to finding instruments</a>.</p>
<p>References:</p>
<ul>
<li>Chapters 12 and 13 of an <a href="http://pluto.huji.ac.il/%7Emsby/StatThink/index.html">Introduction to Statistical Thinking (With R, Without Calculus)</a></li>
<li><a href="http://journals.plos.org/plosmedicine/article/file?id=10.1371/journal.pmed.0020124&type=printable">Why Most Published Research Findings Are False</a></li>
<li><a href="http://www.thaddunning.com/wp-content/uploads/2009/12/Dunning_IEPS_InstrumentalVariables2.pdf">Instrumental Variables</a> by Thad Dunning (followup <a href="http://www.thaddunning.com/wp-content/uploads/2009/12/Dunning-PA.pdf">here</a>)</li>
<li>See Chapter 5 of <a href="http://www.cambridge.org/gb/academic/subjects/politics-international-relations/research-methods-politics/natural-experiments-social-sciences-design-based-approach">Natural Experiments in the Social Sciences</a> by Dunning for more detail</li>
<li><a href="http://students.brown.edu/seeing-theory/">Seeing Theory</a>, a visual, simulation-based tour of statistics</li>
<li>Felix Schönbrodt’s <a href="http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/">blog post</a> and
<a href="http://shinyapps.org/apps/PPV/">shiny app</a> on misconceptions about p-values and false discoveries</li>
<li><a href="http://www.cyclismo.org/tutorial/R/power.html">Calculating the power of a test</a></li>
<li><a href="http://science.sciencemag.org/content/sci/349/6251/aac4716.full.pdf">Estimating the reproducibility of psychological science</a> by Nosek, et. al.</li>
<li><a href="http://www.nature.com/nrn/journal/v14/n5/pdf/nrn3475.pdf">Power failure: why small sample
size undermines the reliability of
neuroscience</a> by Button, et. al.</li>
<li><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">False-Positive Psychology</a> by Simmons, Nelson & Simonsohn</li>
<li>Science magazine’s announcement of <a href="http://www.sciencemag.org/careers/2015/12/register-your-study-new-publication-option">registered reports</a></li>
<li>Pre-registration portals from the <a href="https://osf.io/registries/">Open Science Framework</a>, <a href="https://cos.io/prereg/">Center for Open Science</a>, and <a href="https://aspredicted.org/index.php">AsPredicted.org</a></li>
<li><a href="http://www.pnas.org/content/111/24/8788.full.pdf">Experimental evidence of massive-scale emotional contagion through social networks</a> by Kramer, Guillory & Hancock</li>
<li><a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">The garden of forking paths</a> by Gelman & Loken</li>
<li><a href="http://www.nytimes.com/2015/04/07/upshot/alcoholics-anonymous-and-the-challenge-of-evidence-based-medicine.html">Instrumental variables</a> for clincal trials discussed in the New York Times</li>
<li><a href="https://www.nature.com/articles/ncomms14753">Exercise contagion in a global social network</a> by Aral & Nicolaides</li>
<li><a href="http://jakehofman.com/inprint/amazonrecs.pdf">Estimating the causal impact of recommendation systems from observational data</a> by Sharma, Hofman & Watts</li>
</ul>
Fri, 21 Apr 2017 10:00:00 +0000Lecture 11: Causality & Experiments, Part 1
http://modelingsocialdata.org/lectures/2017/04/21/lecture-11-causality-and-experiments-1.html
http://modelingsocialdata.org/lectures/2017/04/21/lecture-11-causality-and-experiments-1.html<p>This was a joint guest lecture from <a href="http://www.andrewmao.net">Andrew Mao</a> and <a href="http://www.amitsharma.in">Amit Sharma</a> with an overview of causal inference and randomized experiments.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/J5DJRcIaKj5xU8" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>Most of what we’ve discussed in this class has focused on observational data—data obtained without direct intervention from or manipulation by those studying it.
We can learn a lot from observational data and use it to find interesting relationships, build predictive models, or even to generate hypotheses, but it has it limits.
This is often summarized by catchy phrases such as <a href="http://freakonomics.com/2009/06/30/so-long-and-thanks-for-all-the-f-tests/">“correlation is not causation”</a> or <a href="http://freakonomics.com/2009/06/30/so-long-and-thanks-for-all-the-f-tests/">“no causation without manipulation”</a>.</p>
<p>Amit opened this discussion by comparing two scenarios: (a) making a forecast about a static world with (b) trying to predict what happens when you change something in the world.
For the former you might do well by simply recognizing correlations (e.g., seeing my neighbor with an umbrella might predict rain), but the latter requires a more robust model of the world (e.g., handing my neighbor an umbrella is unlikely to cause rain).
We discussed the idea of trying to estimate the “effects of causes”, touching on the <a href="https://en.wikipedia.org/wiki/Rubin_causal_model">potential outcomes</a> and <a href="https://en.wikipedia.org/wiki/Causal_graph">causal graphical model</a> frameworks.</p>
<p>Using the effect of hospitalization on health as an example, we talked about confounding factors that complicate causal inference.
For instance, my health today might affect both whether I go to the hospital as well as my health tomorrow, making it difficult to isolate the effect of hospitalization on health from other factors.
We saw this mathematized in what Varian calls the “basic identity of causal inference”: observational estimates conflate the average treatment effect with selection bias, where selection bias measures the baseline difference between those who opted into treatment and those who didn’t.
Amit also discussed <a href="http://en.wikipedia.org/wiki/Simpson's_paradox">Simpson’s paradox</a>, where selection bias is so large that it leads to a directionally incorrect estimate of a causal effect: what appears to be a positive correlation without adjusting for possible confounds can in fact become a negative one when all available information is accounted for.</p>
<p>Andrew then introduced counterfactuals and <a href="http://en.wikipedia.org/wiki/Randomized_experiment">randomized experiments</a>.
The question you’d really like to answer is this: if you cloned each person and sent one copy of that person to the hospital, but not the other, what would the resulting difference in health be?
Short of being able to do this, we could ask a slightly different question: if we had two groups of people who were nearly identical in every way and we sent one group to the hospital, but not the other, how would the health of the two groups differ?
This is precisely the idea behind randomized experiments, such as <a href="http://en.wikipedia.org/wiki/Clinical_trial">clinical trials</a> in medicine and <a href="http://en.wikipedia.org/wiki/A/B_testing">A/B testing</a> for online platforms.
Randomization is key here, as it provides a way of creating two groups that are as similar as possible prior to the treatment (e.g., hospitalization) being administered: if people are randomly assigned to groups, then there shouldn’t be any systematic difference between the two groups, eliminating selection bias.
Since the only difference between the groups is that one gets treated and the other doesn’t, we can ascribe differences in the outcome to the treatment.</p>
<p>While randomized experiments are the “gold standard” for causal inference, Andrew discussed some caveats and limitations in traditional approaches to experimentation in the social sciences, covering issues of both “internal” and “external” validity.
The first asks whether the experiment was properly designed to isolate the intended effect, whereas the second asks if we should expect the results of the study to generalize to other scenarios.
He proposed large-scale online experiments as a new paradigm that addresses some of these issues, and demonstrated the power of this approach with an in-class replication of his recent experiment showing how people learn to cooperate in the long-run even when it’s not in their interest to do so in the short term.</p>
<p>Amit closed the lecture by introducing natural experiments, where the idea is to exploit naturally occuring variation to tease out causal effects from observational data.
More on this next lecture.</p>
<p>References:</p>
<ul>
<li><a href="http://www.pnas.org/content/113/27/7310.full.pdf">Causal inference in economics and marketing</a> by Hal Varian</li>
<li>Chapter 21 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li><a href="https://www.nature.com/articles/ncomms13800">Resilient cooperators stabilize long-run cooperation in the finitely repeated Prisoner’s Dilemma</a> by Mao, Dworkin, Suri & Watts</li>
<li><a href="http://turkserver.readthedocs.io/en/latest/">TurkServer</a>, the platform Andrew used for this experiment</li>
<li>Chapters 1 and 2 of <a href="http://isps.yale.edu/FEDAI">Field Experiments: Design, Analysis, and Interpretation</a></li>
<li>Matt Blackwell’s lecture notes on <a href="http://www.mattblackwell.org/files/teaching/s03-potential.pdf">causality and potential outcomes</a> as well as <a href="http://www.mattblackwell.org/files/teaching/s04-experiments.pdf">randomized experiments</a></li>
<li>Some notes on <a href="http://andrewgelman.com/2007/12/08/causal_inferenc_2/">causal inference</a> from Andrew Gelman</li>
<li>Why to <a href="http://www.vox.com/2015/3/23/8264355/research-study-hype">be skeptical</a> of most medical studies</li>
</ul>
Fri, 21 Apr 2017 10:00:00 +0000Homework 3
http://modelingsocialdata.org/homework/2017/04/13/homework-3.html
http://modelingsocialdata.org/homework/2017/04/13/homework-3.html<p>The third homework assignment, <a href="https://github.com/jhofman/msd2017/tree/master/homework/homework_3">posted on Github</a>, is due on Monday, April 24 by 11:59pm ET.</p>
<p>The first problem looks at logistic regression for text classification, the second explores the small-world phenomenon in “close” vs. “distant” friend networks, and the third studies how the structure of an email network changes as we remove weak ties from it.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/sis_course_id:APMAE4990_001_2017_1">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Thu, 13 Apr 2017 00:00:00 +0000Lecture 10: Networks
http://modelingsocialdata.org/lectures/2017/04/07/lecture-10-networks.html
http://modelingsocialdata.org/lectures/2017/04/07/lecture-10-networks.html<p>We spent this lecture discussing network data, including a whirlwhind tour of the history of network theory, representations and characteristics of networks, and algorithms for analyzing network data.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/4wtOi0tDYzVPPs" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>People have studied <em>theoretical</em> problems on and properties of graphs for a long time, but only in the last few decades have we had access to <em>real network data</em>, such as online social networks or the topology of the Internet.
When these data became available, it quickly became clear that real networks looked quite different than well-studied theoretical models (e.g., <a href="http://en.wikipedia.org/wiki/Erdős–Rényi_model">Erdős–Rényi</a> random graphs).
For example, many real networks have highly <a href="http://en.wikipedia.org/wiki/Complex_network#Scale-free_networks">skewed degree distributions</a>, reflecting the fact that most people in a social network have few friends while only a few people have many friends.
At the same time, social networks typically have <a href="http://en.wikipedia.org/wiki/Small-world_network">short path lengths</a>, in the sense that one needs only to traverse a handful of links to connect a randomly selected set of people in the network.</p>
<p>After discussing many different types of networks that we might analyze as well as the various levels of abstraction available for representing them, we turned to algorithms for efficiently computing shortest path lengths, connected components, mutual friends, and clustering coefficients.</p>
<p>We started with the problem of finding the shortest distance between a single source node and all other nodes in a (undirected, unweighted) network, as measured by the fewest number of edges you need to traverse to get from the source to every other node.
(Every researcher’s favorite version of this is computing their <a href="http://en.wikipedia.org/wiki/Erdős_number">Erdős number</a>, the academic take on the more well-known <a href="http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon">Kevin Bacon game</a>. Compute yours <a href="http://academic.research.microsoft.com">here</a>.)</p>
<p>Breadth first search (BFS) provides a nice solution.
The intuition behind BFS is simple: we start from the source node and mark it as distance zero from itself.
Then we visit each of its neighbors and mark those as distance one.
We repeat this iteratively, pushing forward a boundary of recently discovered nodes that are one additional hop from the source at each step.
BFS visits each node and edge in a network once, scaling linearly in the size of the network.
If, however, we would like to find the shortest distance between <em>all pairs</em> of nodes then we must repeat this for each possible source node, and so this quickly becomes prohibitively expensive for even moderately sized networks.
(See <a href="http://en.wikipedia.org/wiki/Shortest_path_problem#All-pairs_shortest_paths">here</a> for fancier, more efficient algorithms.)</p>
<p>Next we looked at using BFS for a related problem: finding the number of <a href="http://en.wikipedia.org/wiki/Connected_component_(graph_theory)">connected components</a>, or separate pieces, of a network.
We did this by simply looping over our shortest path code, seeding it on each iteration with a currently unreachable node as the source until we reach all nodes.
We gave the reachable nodes in each BFS a unique label corresponding to its component.</p>
<p>Then we moved on to computing the number of friends that any two nodes have in common, motivated by the problem of friend recommendations on social networks.
The underlying idea can be traced back to <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">Granovetter</a>: two people are likely to know each other if they have many mutual friends.
To compute the number of mutual friends between all pairs of nodes, we exploit the fact that the neighbors of every node share that node as a common friend.
To count all mutual friends we simply loop over each node and increment a counter for every pair of its neighbors.
For each node this scales as the square of its degree, so the whole algorithm scales as the sum of the squared degrees of all nodes.
This can quickly become expensive if we have even a few high-degree nodes, which are quite common in practice.</p>
<p>Finally, we looked at the closely related problem of counting the number of triangles around each node in a network.
This algorithm is nearly identical to computing mutual friends, as we generate the same set of two-hop paths through all pairs of a node’s neighbors, but simply increment different counters to generate different results.
Instead of accumulating mutual friends for each pair of a node’s neighbors, we ask whether every pair of neighbors are themselves directly connected.
If so, we count this as (half of) a triangle in which the node participates.
Dividing the number of closed triangles in a network by the number of possible triangles that could be present gives a useful for how <a href="http://en.wikipedia.org/wiki/Clustering_coefficient">clustered</a> a network is.</p>
<p>To better understand properties of networks and how to compute them, we looked at a few example networks in R using the <code class="highlighter-rouge">igraph</code> package.
See the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_10">notebooks</a> on the course GitHub page for related code and data used in the lectures.</p>
<p>References:</p>
<ul>
<li>Chapters 2, 18, and 20 of Easley and Kleinberg’s <a href="http://www.cs.cornell.edu/home/kleinber/networks-book/">Networks, Crowds, and Markets</a></li>
<li>Granovetter’s <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">Strength of Weak Ties</a> paper</li>
<li>de Solla Price on <a href="http://garfield.library.upenn.edu/papers/pricenetworks1965.pdf">citation networks</a> and <a href="http://garfield.library.upenn.edu/price/pricetheory1976.pdf">cumulative advantage</a></li>
<li><a href="https://www.math.cornell.edu/m/sites/default/files/imported/People/strogatz/nature_smallworld.pdf">Collective dynamics of ‘small-world’ networks</a> by Watts & Strogatz</li>
<li><a href="http://web.stanford.edu/~jugander/papers/websci12-fourdegrees.pdf">Four degrees of separation</a>: scaling up calculations to the entire Facebook social graph</li>
<li><a href="http://www.rebennack.net/SEA2011/files/talks/SEA2011_Pajor.pdf">Customizable route planning</a>: how shortest path calculations are done in modern mapping applications</li>
<li>These <a href="https://berkeleydatascience.files.wordpress.com/2012/03/20120320berkeley.pdf">slides</a> on the early system for friend recommendation on Facebook (pages 28 to 37)</li>
</ul>
<!--
BFS computes shortest path: http://www.cs.toronto.edu/~krueger/cscB63h/lectures/BFS.pdf
BFS runtime and correctness: http://www.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL06.ps
[MapReduce for networks](http://jakehofman.com/icwsm2010/slides.html)
https://github.com/jhofman/icwsm2010_tutorial
[Curse of the last reducer](http://theory.stanford.edu/~sergei/papers/www11-triangles.pdf)
[Model of MapReduce](http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf)
[Facebook at scale](http://arxiv.org/abs/1111.4503)
-->
Fri, 07 Apr 2017 10:00:00 +0000Lectures 8 & 9: Classification
http://modelingsocialdata.org/lectures/2017/03/24/lectures-8-9-classification.html
http://modelingsocialdata.org/lectures/2017/03/24/lectures-8-9-classification.html<p>This post covers two lectures on classification, the first a guest lecture from <a href="http://www.columbia.edu/~chw2/">Chris Wiggins</a>.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/n0IHNlWKh5z0Di" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>Chris opened his guest lecture by introducing the problem of classification, where the outcome is categorical (e.g., whether an email is spam or <a href="https://wiki.apache.org/spamassassin/Ham">ham</a>) rather than continuous.
We first reviewed <a href="http://en.wikipedia.org/wiki/Bayes'_rule">Bayes’ rule</a> for inverting conditional probabilities via a simple, but <a href="http://bit.ly/ggbbc">somewhat counterintuitive</a>, <a href="http://www.scientificamerican.com/article/what-is-bayess-theorem-an/">medical diagnosis example</a> and then adapted this to an (extremely naive) <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_8/enron_naive_bayes.sh">one-word spam classifier</a>.
We improved upon this by considering all words present in a document and arrived at naive Bayes—a simple linear method for classification in which we model each word occurrence independently and use Bayes’ rule to calculate the probability the document belongs to each class given the words it contains.
Chris concluded with a unifying overview of various loss functions and derived <a href="https://en.wikipedia.org/wiki/Boosting_%28machine_learning%29">boosting</a> under expotential loss.</p>
<p>Although naive Bayes makes an obviously incorrect assumption that all features are independent, it turns out to be a reasonably useful method in practice.
It’s simple and scalable to train, easy to update as new data arrive, easy to interpret, and often more competitive in performance than one might expect.
That said, there are some obvious issues with naive Bayes as presented, namely overfitting in the training process and overconfidence / miscalibration when making predictions.</p>
<p>The first issue arises when thinking about how to estimate word probabilities.
Simple maximum likelihood estimates (MLE) for word probabilities lead to overfitting, implying, for instance, that it’s impossible to see a word in a given class in the future if we’ve never seen it occur in that class in the past.
We dealt with this by thinking about maximum a posteriori (MAP) estimation which led to the idea of <a href="https://en.wikipedia.org/wiki/Additive_smoothing">Laplace smoothing</a>, or adding <a href="http://en.wikipedia.org/wiki/Pseudocount">pseudocounts</a> to empirical word counts to prevent overfitting.
As usual, determining the amount of smoothing to use is an empirical question, often solved by methods such as cross-validation.</p>
<p>As for the second problem of feature independence, we addressed this by abandoning naive Bayes in favor of logistic regression.
Logistic regression makes predictions using the same functional form as naive Bayes—the log-odds are modeled as a weighted combination of feature values—but fits these weights in a manner that accounts for correlations between features.
We (once again) applied the maximum likelihood principle to arrive at criteria for estimating these weights, and discussed gradient descent and <a href="http://en.wikipedia.org/wiki/Newton's_method">Newton’s methods</a> for solutions.
The resulting algorithms are very close in spirit to those for linear regression, but slightly more complex due to the logistic function.
And, similar to linear regression, we discussed the idea of regularizing logistic regression by including a term in the loss function to penalize large weight vectors.</p>
<p>We concluded with a discussion of several metrics for evaluating classifiers, including calibration, <a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrices</a>, accuracy, precision and recall, and the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC curve</a>.</p>
<p>A few references:</p>
<ul>
<li>Chapter 12 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li>Chapter 4 of <a href="http://www-bcf.usc.edu/~gareth/ISL/getbook.html">An Introduction to Statistical Learning</a></li>
<li><a href="http://www.cs.iastate.edu/~honavar/bayes-lewis.pdf">Naive Bayes at 40</a> by Lewis (1998)</li>
<li><a href="http://www.jstor.org/pss/1403452">Idiots Bayes—Not So Stupid After All?</a> by Hand and Yu (2001)</li>
<li><a href="http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf">A Bayesian Approach to Filtering Junk E-mail</a> from Sahami, Dumais, Heckerman, and Horvitz (1998)</li>
<li><a href="http://www.paulgraham.com/spam.html">A Plan for Spam</a> by Paul Graham (2002)</li>
<li><a href="https://ccrma.stanford.edu/workshops/mir2009/references/ROCintro.pdf">An introduction to ROC analysis</a></li>
</ul>
Fri, 24 Mar 2017 10:00:00 +0000Homework 2
http://modelingsocialdata.org/homework/2017/03/14/homework-2.html
http://modelingsocialdata.org/homework/2017/03/14/homework-2.html<p>The second homework assignment, <a href="https://github.com/jhofman/msd2017/tree/master/homework/homework_2">posted on Github</a>, is due on Monday, March 27 by 11:59pm ET.</p>
<p>The first problem explores various modeling scenarios, the second looks at cross-validation for polynomial regression, and the third involves fitting and interpreting a model of supermarket sales data.
Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/sis_course_id:APMAE4990_001_2017_1">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Tue, 14 Mar 2017 20:00:00 +0000Lecture 7: Regression, Part 2
http://modelingsocialdata.org/lectures/2017/03/03/lecture-7-regression-2.html
http://modelingsocialdata.org/lectures/2017/03/03/lecture-7-regression-2.html<p>This was the second lecture on the theory and practice of regression, focused on model complexity and generalization.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/AO2fqTF50kBrOb" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>We started by revisiting the pageview prediction problem from last lecture.
Last time we worked on constructing a model that captured some of the trends in typical browsing activity as a function of gender and age.
We saw that including quadratic terms for age and interacting this with age gave a reasonable model, at least in terms of visually matching empirical aggregates.
This time we talked about two high-level points.
First, quantifying model fit and second, knowing when to stop fitting.
In the setting above, this translates to asking “how good is a quadratic fit” and “why shouldn’t I use a cubic, or quartic, etc.?”</p>
<p>To the first point, we discussed <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">root mean squared error (RMSE)</a> and the <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination">coefficient of determination (<script type="math/tex">R^2</script>)</a> as sensible metrics of model fit.
RMSE is just the squared loss function we discussed last time, with a square root to adjust units to match those of the outcome we’re trying to predict.
It’s useful when we already have a sense of absolute scale for “what’s good”.
The coefficient of determination, on the other hand, captures the fraction of variance in outcomes explained by the model, and is useful when we don’t have such a scale or are comparing across different problems.
We showed that this is the same as comparing the mean squared error (MSE) of the model to the MSE of a simple baseline where we always predict the average outcome.
Finally, we discussed the connection between <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson’s correlation coefficient</a> and <script type="math/tex">R^2</script>.
See <a href="https://economictheoryblog.com/2014/11/05/proof/">here</a> for a proof that the latter is in fact the square of the former.</p>
<p>Applying both of these metrics to the pageview dataset, we saw that while there were systematic trends in typical viewing behavior by age and gender, there was still a surprisingly large amount of variation in individual activity for people of the same age and gender.</p>
<p>This led us to our second high-level topic, the question of complexity control: How complicated should we make our model?
We discussed the idea of generalization error, and how we’d like models that are both complex enough to account for the past and simple enough to predict the future.
Cross-validation is the most common approach to navigating this tradeoff, where we divide our data into a training set for fitting models, a validation set for comparing these different fits, and a test set that’s used once (and <em>only once</em>) to quote the expected future performance of the model we end up selecting.
We talked about <a href="https://www.youtube.com/watch?v=TIgfjmp-4BA">k-fold cross-validation</a> as a more statistically robust version of estimating generalization error.</p>
<p>We also phrased this issue in terms of the <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">bias-variance tradeoff</a>.
Simple models are likely biased in that they systematically misrepresent the world, and would do so even with an infinite amount of data.
At the same time, estimating a simple model is a low variance procedure in that our results don’t change substantially when we fit it on different samples of data.
More flexible models, on the other hand, have little bias and can capture more complex patterns in the world.
The downside is that this flexibility also renders such models sensitive to noise, often leading to high variance, or drastically different results with different samples of the data.</p>
<p>We concluded lecture with a brief discussion of <a href="https://en.wikipedia.org/wiki/Regularization_%28mathematics%29">regularization</a> as a way of modifying loss functions to improve the generalization error of our models by explicitly balancing the fit to the training data with the “complexity” of the model.
The idea is that introducing some bias in our models is sometimes a good idea if the corresponding reduction in variance is enough to lower the mean squared error.</p>
<p>Code from the lecture is up
<a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_7/">on Github</a>.
Also see this interactive Shiny App to explore <a href="https://jmhmsr.shinyapps.io/regularization/">regularization</a>.</p>
<p>References:</p>
<ul>
<li>Chapter 2 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> on the bias-variance tradeoff</li>
<li>Section 1.4 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on the same, with a more detailed derivation
<!-- http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf --></li>
<li>Chapter 5 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> and 3 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on resampling and cross-validation</li>
<li>Recent work on using differentially private mechanisms for <a href="https://research.googleblog.com/2015/08/the-reusable-holdout-preserving.html">reusing holdout sets</a></li>
</ul>
Fri, 03 Mar 2017 10:00:00 +0000Lecture 6: Regression, Part 1
http://modelingsocialdata.org/lectures/2017/02/24/lecture-6-regression-1.html
http://modelingsocialdata.org/lectures/2017/02/24/lecture-6-regression-1.html<p>This was the first of two lectures on the theory and practice of regression.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/M3UPic6Yfewant" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>We started with a high-level overview of regression, which can be broadly defined as any analysis of how one continuous variable (the “outcome”) changes with others (the “inputs”, “predictors”, or “features”).
The goals of a regression analysis can vary, from describing the data at hand, to predicting new outcomes, to explaining the associations between outcomes and predictors.
This includes everything from looking at histograms and scatter plots to building statistical models.</p>
<p>We focused on the latter and discussed ordinary least squares regression.
First, we motivated this as an optimization problem and then connected squared loss minimization to the more general principle of maximum likelihood.
Then we discussed several ways to solve this optimization problem to estimate coefficients for a linear model, which are summarized in the table below.</p>
<table>
<thead>
<tr>
<th>Method</th>
<th style="text-align: center">Space</th>
<th style="text-align: center">Time</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invert normal equations</td>
<td style="text-align: center"><script type="math/tex">N K + K^2</script></td>
<td style="text-align: center"><script type="math/tex">K^3</script></td>
<td>Good for medium-sized datasets with a relatively small number (e.g., hundreds or thousands) of features</td>
</tr>
<tr>
<td>Gradient descent</td>
<td style="text-align: center"><script type="math/tex">N K</script></td>
<td style="text-align: center"><script type="math/tex">NK</script> per step</td>
<td>Good for larger datasets that still fit in memory but have more (e.g., millions) features; requires tuning learning rate</td>
</tr>
<tr>
<td>Stochastic gradient descent</td>
<td style="text-align: center"><script type="math/tex">K</script></td>
<td style="text-align: center"><script type="math/tex">K</script> per step</td>
<td>Good for datasets that exceed available memory; more sensitive to learning rate schedule</td>
</tr>
</tbody>
</table>
<p>See also this interactive Shiny App to explore <a href="(https://jmhmsr.shinyapps.io/modelfit/)">manually fitting a simple model</a> and this notebook by Jongbin Jung with <a href="http://jakehofman.com/gd/">an animation of gradient descent</a>.</p>
<p>In the second half of class we looked at fitting linear models in R, with an application to understanding how internet browsing activity varies by age and gender.
See the <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_6/linear_models.ipynb">Jupyter notebook</a> up on Github for more details.
The main lesson here is that there’s more to modeling than just optimization, with many important steps along the way that range from collecting and specifying outcomes and predictors, to determining the form of a model, to assessing performance and interpreting results.</p>
<p>References:</p>
<ul>
<li>Chapters <a href="http://r4ds.had.co.nz/model-basics.html">23</a> and <a href="http://r4ds.had.co.nz/model-building.html">24</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>Chapter 3 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a></li>
<li>Chapters 1 and 2 of <a href="http://www.stat.cmu.edu/%7Ecshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li><a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/statistical-models-theory-and-practice-2nd-edition?format=PB">Statistical Models</a> by David Freedman</li>
<li><a href="https://us.sagepub.com/en-us/nam/regression-analysis/book226138">Regression Analysis</a> by Richard Berk</li>
</ul>
Fri, 24 Feb 2017 10:00:00 +0000Lecture 5: Data Visualization
http://modelingsocialdata.org/lectures/2017/02/17/lecture-5-data-visualization.html
http://modelingsocialdata.org/lectures/2017/02/17/lecture-5-data-visualization.html<p>We had a guest lecture from <a href="http://hci.stanford.edu/~cagatay//">Çağatay Demiralp</a> on data visualization.</p>
<center>
<iframe src="https://docs.google.com/viewer?srcid=0B-M9UEiE6KFAWmtvUjQta0RFNkk&pid=explorer&efh=false&a=v&chrome=false&embedded=true" width="476px" height="400px" frameborder="0" marginwidth="0" marginheight="0"></iframe>
</center>
<!--
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/7VTVGmJRVcQ1Ln" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
-->
<p>Çağatay discussed both the principles and practice of data visualization, starting with historical examples of <a href="https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak">John Snow’s visualization</a> of cholera outbreaks and <a href="https://en.wikipedia.org/wiki/Florence_Nightingale#/media/File:Nightingale-mortality.jpg">Florence Nightingale’s infographic</a> on causes of death in the army.
He emphasized Stuart Card’s point that visualizations represent data in a way that <a href="https://books.google.com/books?id=wdh2gqWfQmgC&lpg=PP1&dq=Readings%20in%20Information%20Visualization%3A%20Using%20Vision%20to%20Think&pg=PA15#v=onepage&q=amplify%20cognition&f=false">amplifies cognition</a>, making it easier to see patterns in data, a point nicely illustrated by <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">Anscombe’s Quartet</a>.</p>
<p>We discussed the perceptual aspects of visualizations, including <a href="https://en.wikipedia.org/wiki/Stevens%27_power_law">Stevens’ Power Law</a>, and experiments by <a href="http://www.jstor.org/stable/2288400?seq=1#page_scan_tab_contents">Cleveland and McGill</a> showing that not all visual encodings are created equal, and that the best encoding depends on the type of data being visualized.
He closed with a discussion of different data visualization tools, including <a href="http://dl.acm.org/citation.cfm?id=22950">Mackinlay’s</a> expressiveness / effectiveness tradeoff and <a href="https://en.wikipedia.org/wiki/Leland_Wilkinson">Wilkinson’s</a> grammar of graphics.</p>
<p>In the second part of class we look at <code class="highlighter-rouge">ggplot2</code>, Hadley Wickham’s popular implementation of Wilkinson’s grammar of graphics.
We focused on using <code class="highlighter-rouge">ggplot2</code> to effectively communicate information through visualizations.
Every visualization should convey a point, preferrably one that can be summarized by a short sentence.
This <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_5/visualization_with_ggplot2.ipynb">Jupyter notebook</a> provides an intro to <code class="highlighter-rouge">ggplot2</code>, detailing how the choices we make in the visualization process affect the messages our plots and figures convey.</p>
<p>Readings and references:</p>
<ul>
<li>Chapters <a href="http://r4ds.had.co.nz/data-visualisation.html">3</a>, <a href="http://r4ds.had.co.nz/exploratory-data-analysis.html">7</a>, and <a href="http://r4ds.had.co.nz/graphics-for-communication.html">28</a> in <a href="http://r4ds.had.co.nz/">R for Data Science</a></li>
<li>RStudio’s <a href="https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf">ggplot2 cheatsheet</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/">Data Visualization with ggplot2</a> tutorial</li>
<li>Videos on <a href="http://varianceexplained.org/RData/lessons/lesson2/">Visualizing Data with ggplot2</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/courses/12-ggplot2/ggplot2_slides_with_examples.pdf">ggplot2 slides</a> (<a href="(http://github.com/seananderson/datawranglR)">code</a>) for more examples</li>
<li>The <a href="http://docs.ggplot2.org/current/">official ggplot2 docs</a></li>
</ul>
Fri, 17 Feb 2017 10:10:00 +0000Lecture 4: Counting at Scale
http://modelingsocialdata.org/lectures/2017/02/10/lecture-4-counting-at-scale.html
http://modelingsocialdata.org/lectures/2017/02/10/lecture-4-counting-at-scale.html<p>In this lecture we discussed combining and reshaping data in R as well as counting at scale with MapReduce.</p>
<p>First we extended last week’s discussion of data manipulation in R by looking at the various joins (inner, left, full, and anti) for combining different tables available in <code class="highlighter-rouge">dplyr</code>.
Then we used the <code class="highlighter-rouge">tidyr</code> package to reshape data that comes in inconvenient formats (e.g., from long to wide with <code class="highlighter-rouge">spread</code>, or vice versa with <code class="highlighter-rouge">gather</code>).</p>
<p>See this <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_4/combine_and_reshape_in_r.ipynb">Jupyter notebook</a> for more details.
Additional readings include <a href="http://r4ds.had.co.nz/tidy-data.html">Chapter 12</a> of <a href="http://r4ds.had.co.nz/">R for Data Science</a> for <code class="highlighter-rouge">tidyr</code> and <a href="http://r4ds.had.co.nz/relational-data.html">Chapter 13</a> for joins.
There are also useful vignettes for <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html">two-table verbs</a> in <code class="highlighter-rouge">dplyr</code> and <a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">tidy data</a> with <code class="highlighter-rouge">tidyr</code>.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/7VTVGmJRVcQ1Ln" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</center>
<p>In the second half of class we talked about counting at scale with <a href="http://research.google.com/archive/mapreduce.html">MapReduce</a>.
At its core, MapReduce is a distributed system for solving the split/apply/combine problem at scale, essentially functioning as a distributed group-by operation.
The programmer implements a <code class="highlighter-rouge">map</code> function, which defines how records should be split in to groups and a <code class="highlighter-rouge">reduce</code> function that defines what to compute within each group.
The system takes care of the rest of the complex engineering details, from distributed storage to fault tolerance, in a manner that makes the parallelism virtually transparent to the programmer.</p>
<p><a href="http://hadoop.apache.org/">Hadoop</a> is a popular open source implementation of the MapReduce paradigm.
We discussed how <a href="https://hadoop.apache.org/docs/r1.2.1/streaming.html">Hadoop Streaming</a> can be used to scale existing code, and briefly looked at higher-level languages that abstract away some low-level MapReduce details from the programmer.
For instance, <a href="http://pig.apache.org">Pig</a> is a high-level language that converts sequences of common data analysis operations (e.g., filter, sort, join, group by, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster.
<a href="http://hive.apache.org">Hive</a> is similar, but follows the <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> paradigm more closely.</p>
<p>See this <a href="https://vgc.poly.edu/~juliana/courses/cs6093/Readings/dean-cacm2008.pdf">CACM article</a> and <a href="http://infolab.stanford.edu/~ullman/mmds/ch2.pdf">Chapter 2</a> of <a href="http://mmds.org/">Mining Massive Data Sets</a> for more on MapReduce.
Michael Noll also has a nice <a href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">tutorial</a>.
And code for the wordcount example we covered in class is on the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_4">course Github page</a>.</p>
Fri, 10 Feb 2017 10:10:00 +0000Homework 1
http://modelingsocialdata.org/homework/2017/02/10/homework-1.html
http://modelingsocialdata.org/homework/2017/02/10/homework-1.html<p>The first homework assignment, <a href="https://github.com/jhofman/msd2017/tree/master/homework/homework_1">posted on Github</a>, is due on Thursday, February 23 by 11:59pm ET.</p>
<p>The first problem explores various counting techniques, the second involves some command line and R counting exercises, and the third looks at the impact of inventory size on customer satisfaction for the MovieLens data.
Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/sis_course_id:APMAE4990_001_2017_1">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Fri, 10 Feb 2017 08:00:00 +0000Lecture 3: Computational complexity
http://modelingsocialdata.org/lectures/2017/02/03/lecture-3-computational-complexity.html
http://modelingsocialdata.org/lectures/2017/02/03/lecture-3-computational-complexity.html<p>We had a guest lecture from <a href="http://sidsen.org/">Sid Sen</a> on computational complexity and algorithm analysis.</p>
<p><img src="http://modelingsocialdata.org/img/runtime_table.png" alt="Algorithm runtime in seconds, from Kleinberg & Tardos" /></p>
<p>Sid discussed various ways of analyzing how long algorithms take to run, focusing on worst-case analysis.
We discussed <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/asymptotic-notation">asymptotic notation</a> (<a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation">big-O</a> for upper bounds, <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-omega-notation">big-omega</a> for lower bounds, and <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-theta-notation">big-theta</a> for tight bounds).
The table above, from <a href="https://www.pearsonhighered.com/program/Kleinberg-Algorithm-Design/PGM319216.html">Algorithm Design</a> by Kleinberg and Tardos, shows how long we should expect different algorithms to run on modern hardware.
The key takeaway is that knowing how to match the right algorithm to your dataset is important.
For instance, when you’re dealing with millions of observations, only linear (or maybe <a href="https://en.wikipedia.org/wiki/Time_complexity#Linearithmic_time">linearithmic</a>) time algorithms are practical.</p>
<p>A few other references:</p>
<ul>
<li>A <a href="https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/">beginner’s guide</a> to big-O notation</li>
<li>Another <a href="https://www.interviewcake.com/article/python/big-o-notation-time-and-space-complexity">introduction to big-O</a></li>
<li>The <a href="http://bigocheatsheet.com/">big-O cheatsheet</a></li>
</ul>
<p>Sid finished his lecture by discussing how this applies to something as simple as taking the intersection of two lists, useful for <a href="https://en.wikipedia.org/wiki/Join_(SQL)">joining</a> different tables.
A naive approach of comparing all pairs of elements takes quadratic time.
It’s relatively easy to do much better by <a href="https://en.wikipedia.org/wiki/Sort-merge_join">sorting and merging</a> the two sets, reducing this to <code class="highlighter-rouge">n log(n)</code> time.
And if we’re willing to trade space for time, we can use a <a href="https://en.wikipedia.org/wiki/Hash_table">hash table</a> to get the job done in linear time, known as a <a href="https://en.wikipedia.org/wiki/Hash_join">hash join</a>.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/ejmirP42ECxx3f" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
</center>
<p>We used the second half of lecture to discuss <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_3/intro_to_r.ipynb">data manipulation in R</a>, specifically focusing on using <a href="https://github.com/hadley/dplyr"><code class="highlighter-rouge">dplyr</code></a> from the <a href="http://tidyverse.org"><code class="highlighter-rouge">tidyverse</code></a> for a convenient implementation of the split / apply / combine framework.</p>
<p>We started this lecture with a brief tour of using the <a href="http://www.rstudio.com">RStudio</a> IDE.
In particular, we focused on <a href="http://had.co.nz">Hadley Wickham’s</a> latest tool, <code class="highlighter-rouge">dplyr</code> (<a href="http://cran.r-project.org/web/packages/dplyr/index.html">CRAN</a>, <a href="https://github.com/hadley/dplyr">GitHub</a>), which provides a particularly nice implementation of the <a href="http://bit.ly/splitapplycombine">split/apply/combine</a> paradigm.
Source code is available on the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_3">course GitHub page</a>.</p>
<p>There are <a href="https://pinboard.in/u:jhofman/t:r/t:tutorials/">lots of R resources</a> available on the web, but here are a few highlights:</p>
<ul>
<li><a href="http://tryr.codeschool.com">CodeSchool</a> and <a href="https://www.datacamp.com/courses/free-introduction-to-r">DataCamp</a> intro to R courses</li>
<li>More about <a href="http://www.r-tutor.com/r-introduction/basic-data-types">basic types</a> (numeric, character, logical, factor) in R</li>
<li>Vectors, lists, dataframes: a <a href="http://www.statmethods.net/input/datatypes.html">one page reference</a> and [more details]</li>
<li>Chapters <a href="http://r4ds.had.co.nz/introduction.html">1</a>, <a href="http://r4ds.had.co.nz/explore-intro.html">2</a>, and <a href="http://r4ds.had.co.nz/transform.html">5</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/dplyr-data-manipulation-r-tutorial">Data Manipulation in R</a> tutorial</li>
<li>The <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">dplyr vignette</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/2014/09/13/dplyr-intro.html">dplyr and pipes examples</a> (<a href="https://github.com/seananderson/dplyr-intro-2014">code</a> on github)</li>
<li>Rstudio’s <a href="http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">data wrangling cheatsheet</a></li>
<li>Hadley Wickham’s <a href="http://adv-r.had.co.nz/Style.html">R style guide</a></li>
</ul>
Fri, 03 Feb 2017 00:00:00 +0000Lecture 2: Introduction to Counting
http://modelingsocialdata.org/lectures/2017/01/27/lecture-2-counting.html
http://modelingsocialdata.org/lectures/2017/01/27/lecture-2-counting.html<p>Counting is surprisingly useful for understanding and summarizing social data. The key is figuring out what to count and how to count it efficiently.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/3O721xJmxzHLuh" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
</center>
<p>While it’s a seemingly simple concept, counting can be quite challenging in practice, especially when dealing with large, multi-dimensional data.</p>
<p>We discussed the <a href="http://bit.ly/splitapplycombine">split/apply/combine</a> paradigm for counting and applied it to several examples from <a href="http://5harad.com/papers/long_tail.pdf">The Anatomy of the Long Tail</a>.
We also looked at alternative models for counting that trade off flexibility for scalability, such as <a href="http://en.wikipedia.org/wiki/Streaming_algorithm">streaming algorithms</a>.
Streaming allows us to compute statistics such as the mean or <a href="http://www.johndcook.com/blog/standard_deviation/">variance</a> without having to read all of the data into memory first.
We summarized these approaches and compared the types of statistics that can be computed under various conditions.</p>
<p>We concluded with more work on the command line, including some simple counting and exploration of the <a href="https://www.citibikenyc.com/system-data">CitiBike trip data</a>.
Slides and code including an “Introduction to the command line” <a href="https://github.com/jhofman/msd2017/blob/master/lectures/lecture_2/intro_command_line.ipynb">notebook</a> are available on the <a href="https://github.com/jhofman/msd2017/tree/master/lectures/lecture_2/">course github page</a>.</p>
<p>Additional command line references can be found in the <a href="/homework/2017/01/20/installing-tools.html">installing tools</a> post.</p>
Fri, 27 Jan 2017 00:00:00 +0000Lecture 1: Overview
http://modelingsocialdata.org/lectures/2017/01/20/lecture-1-overview.html
http://modelingsocialdata.org/lectures/2017/01/20/lecture-1-overview.html<p>We used our first lecture to look at case studies in four main areas: exploratory data analysis, classification, regression, and working with network data.</p>
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/3OAsEKMJjyH2me" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
</center>
<p>We discussed a few examples, including using aggregate search activity to <a href="http://www.pnas.org/content/107/41/17486.full.pdf">predict consumer behavior</a>, exploring browsing logs to understand how Internet usage <a href="http://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewFile/4660/4975">varies across demographic groups</a>, and analyzing the structure of information cascades to understand <a href="https://5harad.com/papers/twiral.pdf">how content spreads online</a>.</p>
<p>During this discussion, we touched on how easy it is to find <a href="http://www.tylervigen.com">spurious correlations</a>, <a href="http://hunch.net/?p=22">cheat at prediction</a>, and <a href="http://www.amazon.com/gp/product/0393310728/">lie with statistics</a>.</p>
Fri, 20 Jan 2017 00:00:00 +0000Installing tools
http://modelingsocialdata.org/homework/2017/01/20/installing-tools.html
http://modelingsocialdata.org/homework/2017/01/20/installing-tools.html<p>This class will involve a good deal of coding, for which you will need some basic tools. Please make sure to set up the following tools after the first day of class.</p>
<h3 id="an-interactive-bash-shell">An interactive <a href="http://www.gnu.org/software/bash/">bash</a> shell</h3>
<p>This will give you the ability to interact with your filesystem via the command line instead of a GUI such as Windows Explorer or Mac Finder. We will also use bash to automate acquiring and cleaning data sets.</p>
<p>If you use Windows, you can try the <a href="http://www.howtogeek.com/249966/how-to-install-and-use-the-linux-bash-shell-on-windows-10/">builtin bash/Ubuntu</a> shell on Windows 10 or you can <a href="https://cygwin.com/install.html">install Cygwin</a> which includes bash and a terminal application by default. Mac OS X includes a bash shell by default, and a terminal application in <code class="highlighter-rouge">/Applications/Utilities</code>. Linux also includes a working shell and terminal.</p>
<p>Verify that your environment is properly configured by typing the following commands indicated after the <code class="highlighter-rouge">#</code> symbol. You should see something similar (although not necessarily identical) to the following:</p>
<div class="highlighter-rouge"><pre class="highlight"><code># echo $SHELL
/bin/bash
# grep --version
grep (BSD grep) 2.5.1-FreeBSD
# cut
usage: cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-s] [-d delim] [file ...]
</code></pre>
</div>
<p>If you’re new to the command line, see Codecademy’s <a href="https://www.codecademy.com/courses/learn-the-command-line/lessons/navigation/exercises/your-first-command?action=lesson_resume">interactive tutorial</a>, this <a href="https://learnpythonthehardway.org/book/appendixa.html">crash course</a>, and Software Carpentry’s <a href="http://swcarpentry.github.io/shell-novice/">guide</a>.
Lifehacker’s <a href="http://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything">command line primer</a> is also decent.</p>
<p>O’Reilly’s <a href="http://shop.oreilly.com/product/9780596005955.do">Classic Shell Scripting</a> book is a more complete reference.</p>
<h3 id="a-git-client">A <a href="http://git-scm.com">Git</a> client</h3>
<p>Git is a version control system that allows you to track modifications to files and code over time. It also facilitates collaborations so that multiple people can share and edit the same code base.</p>
<p>If you are on Windows you can install <a href="https://windows.github.com">Github for Windows</a> which provides both the command line tool for git and a graphical user interface. Alternatively, you can install git as an optional package under Cygwin. We recommend the Github application, as it will be easier to interface with Github using it. Likewise, modern versions of Mac OS X have a command line git client installed by default, but the <a href="https://mac.github.com">Github for Mac</a> tool is a recommended addition. Linux users can install git with the appropriate package manager (e.g., <code class="highlighter-rouge">yum install git</code> on RedHat or <code class="highlighter-rouge">apt-get install git</code>), and there are a number of different <a href="http://unix.stackexchange.com/questions/144100/is-there-a-usable-gui-front-end-to-git-on-linux">git GUIs for Linux</a>.</p>
<p>Complete this relatively brief <a href="https://www.codeschool.com/courses/try-git">interactive tour of git</a>. See this <a href="http://rogerdudler.github.io/git-guide/">one page guide</a> for explanations of the usual git workflow and most common commands, or <a href="http://kbroman.org/github_tutorial/">here</a> for a more verbose guide. Github also has an <a href="https://www.youtube.com/watch?v=U8GBXvdmHT4">introductory video</a>, some <a href="https://services.github.com/training/">training courses</a>, and a handy <a href="https://services.github.com/resources/">cheatsheet</a>.</p>
<h3 id="a-github-account">A <a href="http://github.com">Github</a> account</h3>
<p>Github is a platform that facilitates collaboration on projects that use git. You can use it to host projects, publish them to the web, and share them with other people. <a href="https://help.github.com/articles/signing-up-for-a-new-github-account/">Create a free account</a> if you don’t already have one.</p>
<p>Once you have an account, clone the <a href="https://github.com/jhofman/msd2017">course repository</a> using your local git client. This is most easily done on the command line as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code># git clone https://github.com/jhofman/msd2017.git
Cloning into 'msd2017'...
remote: Counting objects: 145, done.
remote: Compressing objects: 100% (98/98), done.
remote: Total 145 (delta 40), reused 137 (delta 37)
Receiving objects: 100% (145/145), 454.90 KiB | 594.00 KiB/s, done.
Resolving deltas: 100% (40/40), done.
Checking connectivity... done.
</code></pre>
</div>
<p>When this is complete, verify that you have a local directory called <code class="highlighter-rouge">msd2017</code> containing a <code class="highlighter-rouge">README.md</code> file.</p>
<h3 id="r-and-rstudio">R and RStudio</h3>
<p>R is a useful programming language for exploratory data analysis, data visualization, and statistical modeling. RStudio is a popular integrated development environment (IDE) for working in R.</p>
<p>First, download and install R from a <a href="https://cloud.r-project.org/">CRAN mirror</a>. Then download Rstudio from <a href="https://www.rstudio.com/products/rstudio/download/">here</a>. Finally, install and load some important packages as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>install.packages('tidyverse')
library(tidyverse)
</code></pre>
</div>
<p>If you’re new to R, see the <a href="http://tryr.codeschool.com/">Code School</a> and <a href="http://datacamp.com/courses/free-introduction-to-r">DataCamp</a> online tutorials.</p>
<p>We will discuss all of these tools in more detail in class.</p>
Fri, 20 Jan 2017 00:00:00 +0000