http://modelingsocialdata.org/
Wed, 24 Jul 2019 14:14:52 +0000Final project reports
http://modelingsocialdata.org/lectures/2019/05/17/final-project-reports-2019.html
http://modelingsocialdata.org/lectures/2019/05/17/final-project-reports-2019.html<p>Below is a list of the <a href="/final-projects">final projects</a> for the Spring 2019 semester, including a link to the original paper, the students’ final report, and all code and data necessary to reproduce the final report.</p>
<table>
<thead>
<tr>
<th>Group</th>
<th>Original paper</th>
<th>Replication report</th>
<th>Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><a href="https://www.sciencedirect.com/science/article/pii/S0165176599002499">Wage disparity and team productivity: evidence from Major League Baseball</a>, Craig A. Depken II, <em>Economics Letters</em> (2000)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-1/blob/master/03_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-1">Github repository</a></td>
</tr>
<tr>
<td>2</td>
<td><a href="https://www.cambridge.org/core/journals/american-political-science-review/article/ethnicity-insurgency-and-civil-war/B1D5D0E7C782483C5D7E102A61AD6605">Ethnicity, Insurgency, and Civil War</a>, Fearon & Laitin, <em>American Political Science Review</em> (2003)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-2/blob/master/05_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-2">Github repository</a></td>
</tr>
<tr>
<td>3</td>
<td><a href="https://www.cambridge.org/core/journals/american-political-science-review/article/ethnicity-insurgency-and-civil-war/B1D5D0E7C782483C5D7E102A61AD6605">Greed and Grievance in Civil War</a>, Collier & Hoeffler, <em>Oxford Economic Papers</em> (2004)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-3/blob/master/02_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-3">Github repository</a></td>
</tr>
<tr>
<td>4</td>
<td><a href="https://dl.acm.org/citation.cfm?id=1772756">Predicting Positive and Negative Links in Online Social Networks</a>, Leskovec et al., <em>World Wide Web Conference</em> (2010)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-4/blob/master/05_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-4">Github repository</a></td>
</tr>
<tr>
<td>5</td>
<td><a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1475-4932.2012.00809.x">Predicting the Present with Google Trends</a>, Choi & Varian, <em>Economic Record</em> (2012)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-5/blob/master/08_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-5">Github repository</a></td>
</tr>
<tr>
<td>6</td>
<td><a href="https://www.cambridge.org/core/journals/political-analysis/article/comparing-random-forest-with-logistic-regression-for-predicting-classimbalanced-civil-war-onset-data/109E1511378A38BB4B41F721E6017FB1">Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data</a>, Muchlinski et al., <em>Political Analysis</em> (2016)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-6/blob/master/06_Analysis_And_Comparison.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-6">Github repository</a></td>
</tr>
<tr>
<td>7</td>
<td><a href="https://www.aeaweb.org/articles?id=10.1257/pol.1.1.75">Housing, Health, and Happiness</a>, Cattaneo et al., <em>American Economic Journal: Economic Policy</em> (2009)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-7/blob/master/05_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-7">Github repository</a></td>
</tr>
<tr>
<td>8</td>
<td><a href="https://www.aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/viewPaper/15665">Automated Hate Speech Detection and the Problem of Offensive Language</a>, Davidson et al., <em>International Conference on Weblogs and Social Media</em> (2017)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-8/blob/master/05_final_report.pdf">Jupyter pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-8">Github repository</a></td>
</tr>
<tr>
<td>9</td>
<td><a href="https://advances.sciencemag.org/content/1/1/e1400005">Systematic Inequality and Hierarchy in Faculty Hiring Networks</a>, Clauset et al., <em>Science Advances</em> (2015)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-9/blob/master/02_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-final-project-group-9">Github repository</a></td>
</tr>
<tr>
<td>10</td>
<td><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2769645">Chilling Effects: Online Surveillance and Wikipedia Use</a>, Penney, <em>Berkeley Technology Law Journal</em> (2016)</td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-10/blob/master/02_final_report.pdf">Rmarkdown pdf</a></td>
<td><a href="https://github.com/msd2019/msd2019-final-project-group-10">Github repository</a></td>
</tr>
</tbody>
</table>
Fri, 17 May 2019 10:00:00 +0000Lecture 12: Causality & Experiments
http://modelingsocialdata.org/lectures/2019/04/26/lecture-12-causality-and-experiments.html
http://modelingsocialdata.org/lectures/2019/04/26/lecture-12-causality-and-experiments.html<p>In this lecture we discussed causal inference, randomized experiments, and natural experiments.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="f50c15e459d24ecbbe2380a40316a08f" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>Most of what we’ve discussed in this class has focused on observational data—data obtained without direct intervention from or manipulation by those studying it.
We can learn a lot from observational data and use it to find interesting relationships, build predictive models, or even to generate hypotheses, but it has it limits.
This is often summarized by catchy phrases such as <a href="http://freakonomics.com/2009/06/30/so-long-and-thanks-for-all-the-f-tests/">“correlation is not causation”</a> or <a href="http://freakonomics.com/2009/06/30/so-long-and-thanks-for-all-the-f-tests/">“no causation without manipulation”</a>.</p>
<p>We opened this discussion by comparing two scenarios: (a) making a forecast about a static world with (b) trying to predict what happens when you change something in the world.
For the former you might do well by simply recognizing correlations (e.g., seeing my neighbor with an umbrella might predict rain), but the latter requires a more robust model of the world (e.g., handing my neighbor an umbrella is unlikely to cause rain).
We discussed the idea of trying to estimate the “effects of causes”, touching on the <a href="https://en.wikipedia.org/wiki/Rubin_causal_model">potential outcomes</a> and <a href="https://en.wikipedia.org/wiki/Causal_graph">causal graphical model</a> frameworks.</p>
<p>Using the effect of hospitalization on health as an example, we talked about confounding factors that complicate causal inference.
For instance, my health today might affect both whether I go to the hospital as well as my health tomorrow, making it difficult to isolate the effect of hospitalization on health from other factors.
We saw this mathematized in what Varian calls the “basic identity of causal inference”: observational estimates conflate the average treatment effect with selection bias, where selection bias measures the baseline difference between those who opted into treatment and those who didn’t.
We also discussed <a href="http://en.wikipedia.org/wiki/Simpson's_paradox">Simpson’s paradox</a>, where selection bias is so large that it leads to a directionally incorrect estimate of a causal effect: what appears to be a positive correlation without adjusting for possible confounds can in fact become a negative one when all available information is accounted for.</p>
<p>We then introduced counterfactuals and <a href="http://en.wikipedia.org/wiki/Randomized_experiment">randomized experiments</a>.
The question you’d really like to answer is this: if you cloned each person and sent one copy of that person to the hospital, but not the other, what would the resulting difference in health be?
Short of being able to do this, we could ask a slightly different question: if we had two groups of people who were nearly identical in every way and we sent one group to the hospital, but not the other, how would the health of the two groups differ?
This is precisely the idea behind randomized experiments, such as <a href="http://en.wikipedia.org/wiki/Clinical_trial">clinical trials</a> in medicine and <a href="http://en.wikipedia.org/wiki/A/B_testing">A/B testing</a> for online platforms.
Randomization is key here, as it provides a way of creating two groups that are as similar as possible prior to the treatment (e.g., hospitalization) being administered: if people are randomly assigned to groups, then there shouldn’t be any systematic difference between the two groups, eliminating selection bias.
Since the only difference between the groups is that one gets treated and the other doesn’t, we can ascribe differences in the outcome to the treatment.</p>
<p>While randomized experiments are the “gold standard” for causal inference, we discussed some caveats and limitations in traditional approaches to experimentation in the social sciences, covering issues of both “internal” and “external” validity.
The first asks whether the experiment was properly designed to isolate the intended effect, whereas the second asks if we should expect the results of the study to generalize to other scenarios.</p>
<p>We discussed natural experiments as an alternative, where the idea is to exploit naturally occuring variation to tease out causal effects from observational data. We followed <a href="http://www.thaddunning.com/wp-content/uploads/2009/12/Dunning_IEPS_InstrumentalVariables2.pdf">Dunning’s treatment of instrumental variables</a> (IV) by looking at randomized experiments with non-compliance, where there’s a difference between assignment to treatment (e.g., whether you’re told to take a drug) versus receipt of treatment (e.g., whether you actually take it).
The basic idea is that we can estimate two separate quantities: the effect of being assigned a treatment and the odds of actually complying with that assignment.
Dividing the former by the latter provides an estimate of the causal effect of actually receiving the treatment.
Furthermore, we can extend this analysis to situations in which nature provides the randomization instead of a researcher flipping a coin, in which case the source of randomness is referred to as an “instrument” that systematically shifts the probability of being treated.
Classic examples include lotteries or weather events.
<!-- We briefly looked an example of the latter in a recent paper that uses random variations in weather to study [peer effects of exercise](https://www.nature.com/articles/ncomms14753) in social networks. -->
We finished with a brief discussion of regression discontinuity and difference-in-difference designs as well.</p>
<p>References:</p>
<ul>
<li><a href="http://bayes.cs.ucla.edu/WHY/">The Book of Why</a> by Pearl and Mackenzie</li>
<li><a href="https://ftp.cs.ucla.edu/pub/stat_ser/r414.pdf">Understanding Simpson’s Paradox</a> by Judea Pearl</li>
<li>Chapter 21 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li>Chapters 1 and 2 of <a href="http://isps.yale.edu/FEDAI">Field Experiments: Design, Analysis, and Interpretation</a></li>
<li>Matt Blackwell’s lecture notes on <a href="http://www.mattblackwell.org/files/teaching/s03-potential.pdf">causality and potential outcomes</a> as well as <a href="http://www.mattblackwell.org/files/teaching/s04-experiments.pdf">randomized experiments</a></li>
<li>Some notes on <a href="http://andrewgelman.com/2007/12/08/causal_inferenc_2/">causal inference</a> from Andrew Gelman</li>
<li><a href="https://www.nature.com/articles/ncomms13800">Resilient cooperators stabilize long-run cooperation in the finitely repeated Prisoner’s Dilemma</a> by Mao, Dworkin, Suri & Watts</li>
<li><a href="http://www.pnas.org/content/113/27/7310.full.pdf">Causal inference in economics and marketing</a> by Hal Varian</li>
<li><a href="http://www.thaddunning.com/wp-content/uploads/2009/12/Dunning_IEPS_InstrumentalVariables2.pdf">Instrumental Variables</a> by Thad Dunning (followup <a href="http://www.thaddunning.com/wp-content/uploads/2009/12/Dunning-PA.pdf">here</a>)</li>
<li>See Chapter 5 of <a href="http://www.cambridge.org/gb/academic/subjects/politics-international-relations/research-methods-politics/natural-experiments-social-sciences-design-based-approach">Natural Experiments in the Social Sciences</a> by Dunning for more detail</li>
<li><a href="https://www.nature.com/articles/ncomms14753">Exercise contagion in a global social network</a> by Aral & Nicolaides</li>
</ul>
Fri, 26 Apr 2019 10:00:00 +0000Lecture 11: Networks II
http://modelingsocialdata.org/lectures/2019/04/12/lecture-11-networks-2.html
http://modelingsocialdata.org/lectures/2019/04/12/lecture-11-networks-2.html<p>We spent this lecture discussing representations and characteristics of networks and algorithms for analyzing network data.</p>
<p>After discussing many different types of networks that we might analyze as well as the various levels of abstraction available for representing them, we turned to algorithms for efficiently computing shortest path lengths, connected components, mutual friends, and clustering coefficients.</p>
<p>We started with the problem of finding the shortest distance between a single source node and all other nodes in a (undirected, unweighted) network, as measured by the fewest number of edges you need to traverse to get from the source to every other node.
(Every researcher’s favorite version of this is computing their <a href="http://en.wikipedia.org/wiki/Erdős_number">Erdős number</a>, the academic take on the more well-known <a href="http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon">Kevin Bacon game</a>. Compute yours <a href="http://academic.research.microsoft.com">here</a>.)</p>
<p>Breadth first search (BFS) provides a nice solution.
The intuition behind BFS is simple: we start from the source node and mark it as distance zero from itself.
Then we visit each of its neighbors and mark those as distance one.
We repeat this iteratively, pushing forward a boundary of recently discovered nodes that are one additional hop from the source at each step.
BFS visits each node and edge in a network once, scaling linearly in the size of the network.
If, however, we would like to find the shortest distance between <em>all pairs</em> of nodes then we must repeat this for each possible source node, and so this quickly becomes prohibitively expensive for even moderately sized networks.
(See <a href="http://en.wikipedia.org/wiki/Shortest_path_problem#All-pairs_shortest_paths">here</a> for fancier, more efficient algorithms.)</p>
<p>Next we looked at using BFS for a related problem: finding the number of <a href="http://en.wikipedia.org/wiki/Connected_component_(graph_theory)">connected components</a>, or separate pieces, of a network.
We did this by simply looping over our shortest path code, seeding it on each iteration with a currently unreachable node as the source until we reach all nodes.
We gave the reachable nodes in each BFS a unique label corresponding to its component.</p>
<p>Then we moved on to computing the number of friends that any two nodes have in common, motivated by the problem of friend recommendations on social networks.
The underlying idea can be traced back to <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">Granovetter</a>: two people are likely to know each other if they have many mutual friends.
To compute the number of mutual friends between all pairs of nodes, we exploit the fact that the neighbors of every node share that node as a common friend.
To count all mutual friends we simply loop over each node and increment a counter for every pair of its neighbors.
For each node this scales as the square of its degree, so the whole algorithm scales as the sum of the squared degrees of all nodes.
This can quickly become expensive if we have even a few high-degree nodes, which are quite common in practice.</p>
<p>Finally, we looked at the closely related problem of counting the number of triangles around each node in a network.
This algorithm is nearly identical to computing mutual friends, as we generate the same set of two-hop paths through all pairs of a node’s neighbors, but simply increment different counters to generate different results.
Instead of accumulating mutual friends for each pair of a node’s neighbors, we ask whether every pair of neighbors are themselves directly connected.
If so, we count this as (half of) a triangle in which the node participates.
Dividing the number of closed triangles in a network by the number of possible triangles that could be present gives a useful for how <a href="http://en.wikipedia.org/wiki/Clustering_coefficient">clustered</a> a network is.</p>
<p>To better understand properties of networks and how to compute them, we looked at a few example networks in R using the <code class="highlighter-rouge">igraph</code> package.
See the <a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_11">notebooks</a> on the course GitHub page for related code and data used in the lectures.</p>
<p>We finished lecture with a preview of issues related to causal inference.</p>
<p>References:</p>
<ul>
<li>Chapters 2, 18, and 20 of Easley and Kleinberg’s <a href="http://www.cs.cornell.edu/home/kleinber/networks-book/">Networks, Crowds, and Markets</a></li>
<li><a href="https://www.math.cornell.edu/m/sites/default/files/imported/People/strogatz/nature_smallworld.pdf">Collective dynamics of ‘small-world’ networks</a> by Watts & Strogatz</li>
<li><a href="http://web.stanford.edu/~jugander/papers/websci12-fourdegrees.pdf">Four degrees of separation</a>: scaling up calculations to the entire Facebook social graph</li>
<li><a href="http://www.rebennack.net/SEA2011/files/talks/SEA2011_Pajor.pdf">Customizable route planning</a>: how shortest path calculations are done in modern mapping applications</li>
<li>These <a href="https://berkeleydatascience.files.wordpress.com/2012/03/20120320berkeley.pdf">slides</a> on the early system for friend recommendation on Facebook (pages 28 to 37)</li>
<li><a href="http://theory.stanford.edu/~sergei/papers/www11-triangles.pdf">The Curse of the Last Reducer</a></li>
<li><a href="http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf">A Model of Computation for MapReduce</a></li>
</ul>
<!--
BFS computes shortest path: http://www.cs.toronto.edu/~krueger/cscB63h/lectures/BFS.pdf
BFS runtime and correctness: http://www.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL06.ps
[MapReduce for networks](http://jakehofman.com/icwsm2010/slides.html)
https://github.com/jhofman/icwsm2010_tutorial
[Facebook at scale](http://arxiv.org/abs/1111.4503)
-->
Fri, 12 Apr 2019 10:00:00 +0000Homework 4
http://modelingsocialdata.org/homework/2019/04/10/homework-4.html
http://modelingsocialdata.org/homework/2019/04/10/homework-4.html<p>The fourth homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_4">posted on Github</a>, is due on Thursday, April 25 by 11:59pm ET.</p>
<p>The first problem explores the small-world phenomenon in “close” vs. “distant” friend networks, the second studies how the structure of an email network changes as we remove weak ties from it, and the third looks at gender assortativity in networks. Details are in the README.md file for each problem.</p>
<p>Your code and results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.</p>
Wed, 10 Apr 2019 10:00:00 +0000Lecture 10: Networks I
http://modelingsocialdata.org/lectures/2019/04/05/lecture-10-networks-1.html
http://modelingsocialdata.org/lectures/2019/04/05/lecture-10-networks-1.html<p>We used this lecture to first go through applications of logistic regression and then to discuss the history of network science.</p>
<!-- We spent this lecture discussing network data, including a whirlwhind tour of the history of network theory, representations and characteristics of networks, and algorithms for analyzing network data. -->
<center>
<script async="" class="speakerdeck-embed" data-id="7848c1385ff346709bae389edb62613d" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>We started off this lecture by revisiting logistic regression, looking at the problem of modeling which passengers <a href="https://www.kaggle.com/c/titanic">survived the Titanic disaster</a>. We saw that interpreting logistic regression results can be challenging, as coefficients give information about changes in log-odds (as opposed to probabilities directly). We stressed the idea of converting back to probabilities and visually comparing predicted and actual values for a range of feature values to better understand the model fit. See <a href="http://htmlpreview.github.io/?https://github.com/jhofman/msd2019/blob/master/lectures/lecture_10/interpreting_logistic_regression.html">this notebook</a> for details.</p>
<p>Next we discussed <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki">Vowpal Wabbit</a> (VW), an open source tool for various machine learning tasks. VW has many attractive features, such as a flexible input format, speed, scalability, and sensible defaults. For binary classification, VW defaults to fitting a (clipped) linear model to minimize squared loss. We looked at an example of <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Rcv1-example">classifying news</a> with VW to get a sense of the interface and performance, which is quite competetive.</p>
<p>Then we moved on to a history of nertwork science.</p>
<p>We talked about some of the earliest studies of networks, such as Jacob Moreno’s <a href="https://timesmachine.nytimes.com/timesmachine/1933/04/03/99218765.html?action=click&contentCollection=Archives&module=LedeAsset&region=ArchiveBody&pgtype=article&pageNumber=17">sociograms</a> and Mark Granovetter’s work on the <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">strength of weak ties</a>. We contrasted theoretical models of graphs (e.g., <a href="http://en.wikipedia.org/wiki/Erdős–Rényi_model">Erdős–Rényi</a> random graphs) to real-world networks, which tend to have highly <a href="http://en.wikipedia.org/wiki/Complex_network#Scale-free_networks">skewed degree distributions</a> as originally discussed in Derek de Solla Price’s studies of <a href="http://garfield.library.upenn.edu/papers/pricenetworks1965.pdf">citation networks</a>. At the same time, social networks typically have <a href="http://en.wikipedia.org/wiki/Small-world_network">short path lengths</a>, in the sense that one needs only to traverse a handful of links to connect a randomly selected set of people in the network.</p>
<p>We finished by discussing different types of networks that we might analyze as well as the various levels of abstraction available for representing them.</p>
<p>More on networks next time.</p>
<p>References:</p>
<ul>
<li>The <a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki">Vowpal Wabbit Wiki</a></li>
<li>Chapters 2, 18, and 20 of Easley and Kleinberg’s <a href="http://www.cs.cornell.edu/home/kleinber/networks-book/">Networks, Crowds, and Markets</a></li>
<li>Granovetter’s <a href="https://sociology.stanford.edu/sites/default/files/publications/the_strength_of_weak_ties_and_exch_w-gans.pdf">Strength of Weak Ties</a> paper</li>
<li>de Solla Price on <a href="http://garfield.library.upenn.edu/papers/pricenetworks1965.pdf">citation networks</a> and <a href="http://garfield.library.upenn.edu/price/pricetheory1976.pdf">cumulative advantage</a></li>
<li>Milgram’s original <a href="https://en.wikipedia.org/wiki/Small-world_experiment">small world experiment</a></li>
<li><a href="https://www.math.cornell.edu/m/sites/default/files/imported/People/strogatz/nature_smallworld.pdf">Collective dynamics of ‘small-world’ networks</a> by Watts & Strogatz</li>
</ul>
<!--
* [Four degrees of separation](http://web.stanford.edu/~jugander/papers/websci12-fourdegrees.pdf): scaling up calculations to the entire Facebook social graph
* [Customizable route planning](http://www.rebennack.net/SEA2011/files/talks/SEA2011_Pajor.pdf): how shortest path calculations are done in modern mapping applications
* These [slides](https://berkeleydatascience.files.wordpress.com/2012/03/20120320berkeley.pdf) on the early system for friend recommendation on Facebook (pages 28 to 37)
-->
<!--
BFS computes shortest path: http://www.cs.toronto.edu/~krueger/cscB63h/lectures/BFS.pdf
BFS runtime and correctness: http://www.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL06.ps
[MapReduce for networks](http://jakehofman.com/icwsm2010/slides.html)
https://github.com/jhofman/icwsm2010_tutorial
[Curse of the last reducer](http://theory.stanford.edu/~sergei/papers/www11-triangles.pdf)
[Model of MapReduce](http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf)
[Facebook at scale](http://arxiv.org/abs/1111.4503)
-->
Fri, 05 Apr 2019 10:00:00 +0000Lecture 9: Classification
http://modelingsocialdata.org/lectures/2019/03/29/lecture-9-classification.html
http://modelingsocialdata.org/lectures/2019/03/29/lecture-9-classification.html<p>In this lecture we covered classification with linear models, specifically naive Bayes and logistics regression.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="46903fe715de4ab59c254c6a61ea866d" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>We started this lecture by introducing the problem of classification and how it differs from regression: the outcome is categorical (e.g., whether an email is spam or <a href="https://wiki.apache.org/spamassassin/Ham">ham</a>) rather than continuous.
We first reviewed <a href="http://en.wikipedia.org/wiki/Bayes'_rule">Bayes’ rule</a> for inverting conditional probabilities via a simple, but <a href="http://bit.ly/ggbbc">somewhat counterintuitive</a>, <a href="http://www.scientificamerican.com/article/what-is-bayess-theorem-an/">medical diagnosis example</a> and then adapted this to an (extremely naive) <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_9/enron_naive_bayes.sh">one-word spam classifier</a>.
We improved upon this by considering all words present in a document and arrived at naive Bayes—a simple linear method for classification in which we model each word occurrence independently and use Bayes’ rule to calculate the probability the document belongs to each class given the words it contains.</p>
<p>Although naive Bayes makes an obviously incorrect assumption that all features are independent, it turns out to be a reasonably useful method in practice.
It’s simple and scalable to train, easy to update as new data arrive, easy to interpret, and often more competitive in performance than one might expect.
That said, there are some obvious issues with naive Bayes as presented, namely overfitting in the training process and overconfidence / miscalibration when making predictions.</p>
<p>The first issue arises when thinking about how to estimate word probabilities.
Simple maximum likelihood estimates (MLE) for word probabilities lead to overfitting, implying, for instance, that it’s impossible to see a word in a given class in the future if we’ve never seen it occur in that class in the past.
We dealt with this by thinking about maximum a posteriori (MAP) estimation which led to the idea of <a href="https://en.wikipedia.org/wiki/Additive_smoothing">Laplace smoothing</a>, or adding <a href="http://en.wikipedia.org/wiki/Pseudocount">pseudocounts</a> to empirical word counts to prevent overfitting.
As usual, determining the amount of smoothing to use is an empirical question, often solved by methods such as cross-validation.</p>
<p>As for the second problem of feature independence, we addressed this by abandoning naive Bayes in favor of logistic regression.
Logistic regression makes predictions using the same functional form as naive Bayes—the log-odds are modeled as a weighted combination of feature values—but fits these weights in a manner that accounts for correlations between features.
We (once again) applied the maximum likelihood principle to arrive at criteria for estimating these weights, and discussed gradient descent for a solution. The resulting algorithms are very close in spirit to those for linear regression, but slightly more complex due to the logistic function.
And, similar to linear regression, we discussed the idea of regularizing logistic regression by including a term in the loss function to penalize large weight vectors.</p>
<p>We concluded with a discussion of several metrics for evaluating classifiers, including calibration, <a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrices</a>, accuracy, precision and recall, and the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC curve</a>. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_9/classification.ipynb">classification notebook</a> up on Github for more details.</p>
<p>A few references:</p>
<ul>
<li>Chapter 12 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li>Chapter 4 of <a href="http://www-bcf.usc.edu/~gareth/ISL/getbook.html">An Introduction to Statistical Learning</a></li>
<li><a href="http://www.cs.iastate.edu/~honavar/bayes-lewis.pdf">Naive Bayes at 40</a> by Lewis (1998)</li>
<li><a href="http://www.jstor.org/pss/1403452">Idiots Bayes—Not So Stupid After All?</a> by Hand and Yu (2001)</li>
<li><a href="http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf">A Bayesian Approach to Filtering Junk E-mail</a> from Sahami, Dumais, Heckerman, and Horvitz (1998)</li>
<li><a href="http://www.paulgraham.com/spam.html">A Plan for Spam</a> by Paul Graham (2002)</li>
<li><a href="https://ccrma.stanford.edu/workshops/mir2009/references/ROCintro.pdf">An introduction to ROC analysis</a></li>
<li><a href="http://www.navan.name/roc/">Understanding ROC curves</a></li>
<li><a href="https://github.com/VowpalWabbit/vowpal_wabbit/wiki">Vowpal Wabbit</a> for scalable classification</li>
</ul>
Fri, 29 Mar 2019 10:00:00 +0000Homework 3
http://modelingsocialdata.org/homework/2019/03/29/homework-3.html
http://modelingsocialdata.org/homework/2019/03/29/homework-3.html<p>The third homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_3">posted on Github</a>, is due on Thursday, April 11 by 11:59pm ET.</p>
<p>The first problem explores various modeling scenarios, the second looks at cross-validation for polynomial regression, and in the third you’ll use regularized logistic regression to classify New York Times articles. Details are in the README.md file for each problem.</p>
<p>Your code and results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.</p>
Fri, 29 Mar 2019 10:00:00 +0000Lecture 8: Regression, Part 2
http://modelingsocialdata.org/lectures/2019/03/15/lecture-8-regression-2.html
http://modelingsocialdata.org/lectures/2019/03/15/lecture-8-regression-2.html<p>This was the second lecture on the theory and practice of regression, focused on model complexity and generalization.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="8b16d5652bae434e8d478f70bcce6724" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>We started with an applied modeling problem: understanding how internet browsing activity varies by age and gender. We saw that there’s a lot more to modeling than just optimization, with many important steps along the way that range from collecting and specifying outcomes and predictors, to determining the form of a model, to assessing performance and interpreting results. We found that including quadratic terms for age and interacting gender with age gave a reasonable model, at least in terms of visually matching empirical aggregates. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_7/linear_models.ipynb">linear models</a> notebook up on Github for more details.</p>
<p>Then we talked about two high-level points.
First, quantifying model fit and second, knowing when to stop fitting.
In this case, that translates to asking “how good is a quadratic fit” and “why shouldn’t I use a cubic, or quartic, etc.?” or “should I add another interaction?”</p>
<p>To the first point, we discussed <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">root mean squared error (RMSE)</a> and the <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination">coefficient of determination (<script type="math/tex">R^2</script>)</a> as sensible metrics of model fit.
RMSE is just the squared loss function we discussed last time, with a square root to adjust units to match those of the outcome we’re trying to predict.
It’s useful when we already have a sense of absolute scale for “what’s good”.
The coefficient of determination, on the other hand, captures the fraction of variance in outcomes explained by the model, and is useful when we don’t have such a scale or are comparing across different problems.
We showed that this is the same as comparing the mean squared error (MSE) of the model to the MSE of a simple baseline where we always predict the average outcome.
Finally, we discussed the connection between <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson’s correlation coefficient</a> and <script type="math/tex">R^2</script>.
See <a href="https://economictheoryblog.com/2014/11/05/proof/">here</a> for a proof that the latter is in fact the square of the former. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_8/model_evaluation.ipynb">model evaluation</a> notebook on Github for details.</p>
<p>Applying both of these metrics to the pageview dataset, we saw that while there were systematic trends in typical viewing behavior by age and gender, there was still a surprisingly large amount of variation in individual activity for people of the same age and gender.</p>
<p>This led us to our second high-level topic, the question of complexity control: How complicated should we make our model?
We discussed the idea of generalization error, and how we’d like models that are both complex enough to account for the past and simple enough to predict the future.
Cross-validation is the most common approach to navigating this tradeoff, where we divide our data into a training set for fitting models, a validation set for comparing these different fits, and a test set that’s used once (and <em>only once</em>) to quote the expected future performance of the model we end up selecting.
We talked about <a href="https://www.youtube.com/watch?v=TIgfjmp-4BA">k-fold cross-validation</a> as a more statistically robust version of estimating generalization error. See the <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_8/complexity_Control.ipynb">complexity control</a> notebook on Github for details.</p>
<p>We also phrased this issue in terms of the <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">bias-variance tradeoff</a>.
Simple models are likely biased in that they systematically misrepresent the world, and would do so even with an infinite amount of data.
At the same time, estimating a simple model is a low variance procedure in that our results don’t change substantially when we fit it on different samples of data.
More flexible models, on the other hand, have little bias and can capture more complex patterns in the world.
The downside is that this flexibility also renders such models sensitive to noise, often leading to high variance, or drastically different results with different samples of the data.</p>
<p>We concluded lecture with a brief discussion of <a href="https://en.wikipedia.org/wiki/Regularization_%28mathematics%29">regularization</a> as a way of modifying loss functions to improve the generalization error of our models by explicitly balancing the fit to the training data with the “complexity” of the model.
The idea is that introducing some bias in our models is sometimes a good idea if the corresponding reduction in variance is enough to lower the mean squared error.</p>
<p>See Github for an <a href="http://localhost:8888/notebooks/lecture_8/intro_to_glmnet.ipynb">introduction to glmnet</a> as well as this interactive Shiny App to explore <a href="https://jmhmsr.shinyapps.io/regularization/">regularization</a>.</p>
<p>References:</p>
<ul>
<li>Chapter 2 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> on the bias-variance tradeoff</li>
<li>Section 1.4 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on the same, with a more detailed derivation
<!-- http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf --></li>
<li>Chapter 5 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a> and 3 of <a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a> on resampling and cross-validation</li>
<li>Recent work on using differentially private mechanisms for <a href="https://research.googleblog.com/2015/08/the-reusable-holdout-preserving.html">reusing holdout sets</a></li>
<li>Chapters <a href="http://r4ds.had.co.nz/model-basics.html">23</a> and <a href="http://r4ds.had.co.nz/model-building.html">24</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>The <a href="https://modelr.tidyverse.org">modelr</a> and <a href="https://github.com/tidymodels/tidymodels">tidymodels</a> packages in R</li>
<li>The <a href="https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html">glmnet vignette</a></li>
</ul>
Fri, 15 Mar 2019 10:00:00 +0000Lecture 7: Regression, Part 1
http://modelingsocialdata.org/lectures/2019/03/08/lecture-7-regression-1.html
http://modelingsocialdata.org/lectures/2019/03/08/lecture-7-regression-1.html<p>This was the first of two lectures on the theory and practice of regression.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="199594cffb524787a7bced446593789a" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>In the first part of class we shifted from talking about problems in how science is often done to best practices for doing good science. We went through the pipeline of designing a study, piloting and revising it, doing a power calculation, pre-registering the study, running it, creating a reproducible analysis and report, and thinking critically about the results.</p>
<p>Next we moved on to regression.
We started with a high-level overview of regression, which can be broadly defined as any analysis of how one continuous variable (the “outcome”) changes with others (the “inputs”, “predictors”, or “features”).
The goals of a regression analysis can vary, from describing the data at hand, to predicting new outcomes, to explaining the associations between outcomes and predictors.
This includes everything from looking at histograms and scatter plots to building statistical models.</p>
<p>We focused on the latter and discussed ordinary least squares regression.
First, we motivated this as an optimization problem and then connected squared loss minimization to the more general principle of maximum likelihood.
Then we discussed several ways to solve this optimization problem to estimate coefficients for a linear model, which are summarized in the table below.</p>
<table>
<thead>
<tr>
<th>Method</th>
<th style="text-align: center">Space</th>
<th style="text-align: center">Time</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invert normal equations</td>
<td style="text-align: center"><script type="math/tex">N K + K^2</script></td>
<td style="text-align: center"><script type="math/tex">K^3</script></td>
<td>Good for medium-sized datasets with a relatively small number (e.g., hundreds or thousands) of features</td>
</tr>
<tr>
<td>Gradient descent</td>
<td style="text-align: center"><script type="math/tex">N K</script></td>
<td style="text-align: center"><script type="math/tex">NK</script> per step</td>
<td>Good for larger datasets that still fit in memory but have more (e.g., millions) features; requires tuning learning rate</td>
</tr>
<tr>
<td>Stochastic gradient descent</td>
<td style="text-align: center"><script type="math/tex">K</script></td>
<td style="text-align: center"><script type="math/tex">K</script> per step</td>
<td>Good for datasets that exceed available memory; more sensitive to learning rate schedule</td>
</tr>
</tbody>
</table>
<p>See also this interactive Shiny App to explore <a href="https://jmhmsr.shinyapps.io/modelfit/">manually fitting a simple model</a> and this notebook by Jongbin Jung with <a href="http://jakehofman.com/gd/">an animation of gradient descent</a>.</p>
<!--
In the second half of class we looked at fitting linear models in R, with an application to understanding how internet browsing activity varies by age and gender.
See the [Jupyter notebook](https://github.com/jhofman/msd2017/blob/master/lectures/lecture_6/linear_models.ipynb) up on Github for more details.
The main lesson here is that there's more to modeling than just optimization, with many important steps along the way that range from collecting and specifying outcomes and predictors, to determining the form of a model, to assessing performance and interpreting results.
-->
<p>References:</p>
<ul>
<li>Chapter 3 of <a href="http://www-bcf.usc.edu/~gareth/ISL/">An Introduction to Statistical Learning</a></li>
<li>Chapters 1 and 2 of <a href="http://www.stat.cmu.edu/%7Ecshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></li>
<li>Chapter 5 of OpenIntro’s <a href="https://www.openintro.org/stat/textbook.php">Introductory Statistics with Randomization and Simulation</a></li>
<li><a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/statistical-models-theory-and-practice-2nd-edition?format=PB">Statistical Models</a> by David Freedman</li>
<li><a href="https://us.sagepub.com/en-us/nam/regression-analysis/book226138">Regression Analysis</a> by Richard Berk</li>
<li>Chapters <a href="http://r4ds.had.co.nz/model-basics.html">23</a> and <a href="http://r4ds.had.co.nz/model-building.html">24</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
</ul>
Fri, 08 Mar 2019 10:00:00 +0000Lecture 6: Reproducibility and replication, Part 2
http://modelingsocialdata.org/lectures/2019/03/01/lecture-6-reproducibility-2.html
http://modelingsocialdata.org/lectures/2019/03/01/lecture-6-reproducibility-2.html<p>This was our second lecture on reproducibility and replication in which we discussed false discoveries, effect sizes, and p-hacking / researcher degrees of freedom.</p>
<center>
<script async="" class="speakerdeck-embed" data-id="ce73cc7b18114447b75619411419bd76" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</center>
<p>The <a href="/lectures/2019/02/22/lecture-5-reproducibility-1.html">previous lecture</a> provided a high-level overview of the ongoing replication crisis in the sciences. In this lecture we continued the discussion, first by talking about false discoveries. Following Felix Schönbrodt’s excellent <a href="http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/">blog post</a>, we talked about how underpowered studies lead to false discoveries. Then we went on to discuss <a href="https://transparentstats.github.io/guidelines/effectsize.html">effect sizes</a>, specifically <a href="https://en.wikipedia.org/wiki/Effect_size#Cohen's_d">Cohen’s d</a> and the <a href="https://en.wikipedia.org/wiki/Effect_size#Common_language_effect_size">AUC</a>, through this excellent <a href="https://rpsychologist.com/d3/cohend/">visual tool</a>.</p>
<p>Next we spoke about <a href="https://en.wikipedia.org/wiki/Post_hoc_analysis">post-hoc data analysis</a> and <a href="https://en.wikipedia.org/wiki/Data_dredging">p-hacking</a>. We looked at the <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">False-Positive Psychology</a> paper by Simmons, Nelson & Simonsohn, which has an illustrative example of how one can arrive at non-sensical conclusions if there’s enough flexibility in data collection and analysis. Gelman and Loken’s <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">The Garden of Forking Paths</a> makes a similar point, noting that this can often occur without mal intent on the part of the researcher. While these issues are complex, there are few best practices (e.g., running pilot studies followed by <a href="https://aspredicted.org">pre-registration</a> of high-powered, large-scale experiments) that can help mitigate these concerns.
<a href="http://www.sciencemag.org/careers/2015/12/register-your-study-new-publication-option">Registered reports</a> are a particularly attractive solution, wherein researchers write up and submit an experimental study for peer review <em>before</em> the study is conducted. Reviewers make an acceptance decision at this point based on the merit of the study, and, if accepted, it is published regardless of the results. We also discussed how these ideas that come largely from randomized experiments might be adapted for observational studies.</p>
<p>We finished up class by talking about a few tools for computational reproducibility, specifically <a href="https://rmarkdown.rstudio.com">RMarkdown</a> for reproducible documents and <a href="https://bost.ocks.org/mike/make/">Makefiles</a> for efficient workflows. Example files are up
<a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_6">on Github</a>.</p>
<p>References:</p>
<ul>
<li>A guide on <a href="https://transparentstats.github.io/guidelines/effectsize.html">effect sizes</a> and related <a href="https://transparentstatistics.org/2018/07/05/meanings-effect-size/">blog post</a></li>
<li><a href="https://rpsychologist.com/d3/cohend/">Interpreting Cohen’s d effect size</a></li>
<li><a href="https://journals.sagepub.com/doi/pdf/10.1177/0956797613504966">The New Statistics: Why and How</a> by Cummings</li>
<li><a href="https://www.jstor.org/stable/3802789?seq=1#metadata_info_tab_contents">The Insignificance of Significance Testing</a> by Johnson</li>
<li><a href="https://journals.sagepub.com/doi/abs/10.1177/106591299905200309">The Insignificance of Null Hypothesis Significance Testing</a> by Gill</li>
<li><a href="http://journals.plos.org/plosmedicine/article/file?id=10.1371/journal.pmed.0020124&type=printable">Why Most Published Research Findings Are False</a></li>
<li>Felix Schönbrodt’s <a href="http://www.nicebread.de/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/">blog post</a> and
<a href="http://shinyapps.org/apps/PPV/">shiny app</a> on misconceptions about p-values and false discoveries</li>
<li><a href="http://www.cyclismo.org/tutorial/R/power.html">Calculating the power of a test</a></li>
<li><a href="http://www.nature.com/nrn/journal/v14/n5/pdf/nrn3475.pdf">Power failure: why small sample size undermines the reliability of neuroscience</a> by Button, et. al.</li>
<li><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">False-Positive Psychology</a> by Simmons, Nelson & Simonsohn</li>
<li><a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">The garden of forking paths</a> by Gelman & Loken</li>
<li><a href="https://www.cambridge.org/core/journals/psychological-medicine/article/cumulative-effect-of-reporting-and-citation-biases-on-the-apparent-efficacy-of-treatments-the-case-of-depression/71D73CADE32C0D3D996DABEA3FCDBF57/core-reader">The cumulative effect of reporting and citation biases on the apparent efficacy of treatments</a> by de Vries et al. (<a href="https://www.nytimes.com/2018/09/24/upshot/publication-bias-threat-to-science.html?em_pos=small&emc=edit_up_20180924&nl=upshot&nl_art=0&nlid=57978065emc%3Dedit_up_20180924&ref=headline&te=1">popular coverage</a>)</li>
<li>Pre-registration portals from the <a href="https://osf.io/registries/">Open Science Framework</a>, <a href="https://cos.io/prereg/">Center for Open Science</a>, and <a href="https://aspredicted.org/index.php">AsPredicted.org</a></li>
<li>Science magazine’s announcement of <a href="http://www.sciencemag.org/careers/2015/12/register-your-study-new-publication-option">registered reports</a></li>
<li><a href="https://bost.ocks.org/mike/make/">Why Use Make</a> by Mike Bostock</li>
<li><a href="http://zmjones.com/make/">GNU Make for Reproducible Data Analysis</a></li>
<li><a href="https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf">RMarkdown cheatsheet</a></li>
<li><a href="https://rmarkdown.rstudio.com/">RStudio’s RMarkdown site</a></li>
<li>The <a href="https://bookdown.org/yihui/rmarkdown/">RMarkdown: The Definitive Guide</a> book</li>
</ul>
<!-- measures of effect size rosenthal https://books.google.com/books?hl=en&lr=&id=p-aFAwAAQBAJ&oi=fnd&pg=PA231&dq=parametric+measure+of+effect+size+rosenthal&ots=TVzKQfiJTJ&sig=JwandSbd84lwhv0BeK0O9FX8k70#v=onepage&q&f=false -->
Fri, 01 Mar 2019 10:00:00 +0000Homework 2
http://modelingsocialdata.org/homework/2019/02/28/homework-2.html
http://modelingsocialdata.org/homework/2019/02/28/homework-2.html<p>The second homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_2">posted on Github</a>, is due on Thursday, March 14 by 11:59pm ET.</p>
<p>The first problem looks at the link between coffee and cancer, the second problem examines an experiment on whether yawning is contagious, and the third problem involves replicating the results of a paper about the Google ngram dataset. Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.</p>
Thu, 28 Feb 2019 17:00:00 +0000Lecture 5: Reproducibility and replication, Part 1
http://modelingsocialdata.org/lectures/2019/02/22/lecture-5-reproducibility-1.html
http://modelingsocialdata.org/lectures/2019/02/22/lecture-5-reproducibility-1.html<p>We discussed the ongoing <a href="https://en.wikipedia.org/wiki/Replication_crisis">replication crisis</a> in the sciences, wherein it has proven difficult or impossible for researchers to independently verify results of previously published studies.</p>
<script async="" class="speakerdeck-embed" data-id="8c1dd50c57e14f26b3a9c8fbc9837376" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<p>We started off the lecture by talking about how to evaluate research findings. Namely, how can you assess whether the results of a study are believable and/or important?</p>
<p>We took the optimistic view that most researchers are honest, although there are <a href="https://en.wikipedia.org/wiki/List_of_scientific_misconduct_incidents">some exceptions</a>. For instance, a recent study by <a href="http://science.sciencemag.org/content/346/6215/1366.full">LaCour and Green</a> reported that a single conversation with canvassers had lasting impact on support for gay marriage. But soon after the study was published, Broockman, Kalla, and Aronow found <a href="http://stanford.edu/~dbroock/broockman_kalla_aronow_lg_irregularities.pdf">some irregularities</a> in the data. The paper was later <a href="http://www.sciencemag.org/news/2015/05/science-retracts-gay-marriage-paper-without-agreement-lead-author-lacour">retracted</a> on account of the data being fabricated using the results of a previous study. Broockman and Kalla then proceeded to carry out <a href="http://science.sciencemag.org/content/352/6282/220">their own version</a> of such a study, and ironically found <a href="https://www.wired.com/2016/04/political-sciences-whistleblowers-rebunk-gay-canvassing-study/">support for the original hypothesis</a>.</p>
<p>While such instantces of fraud are rare, there are other, more common concerns among published studies. The first is <em>reproducibility</em>, or whether one can independently verify the results of a study with the same data and same code used in the original paper. Though a low bar, most research currently doesn’t pass this test simply because it’s often the case that papers are published without all of the supporting data or code. And when the data and code are available, the code can be surprisingly difficult to understand or run, especially when there are complex software dependencies. This is improving as researchers adopt better software engineering practices and develop <a href="http://science.sciencemag.org/content/354/6317/1240.full">guildelines</a>, <a href="http://www.rctatman.com/files/2018-7-14-MLReproducability.pdf">best practices</a>, and <a href="https://medium.com/@michel.steuwer/artifact-review-and-badging-855dc11b64a0">incentives</a> for reproducibility.</p>
<p>Next we discussed <em>replicability</em>, or whether a result holds when a study is repeated with new data but the same analysis as the original paper. The main issue here is that it’s easy to be fooled by randomness because noise can dominate signal in small datasets and asking too many questions of the data can lead to overfitting, even with large datasets. We looked at a seminal paper from the <a href="https://osf.io/vmrgu/">Open Science Collaboration</a>, <a href="http://science.sciencemag.org/content/349/6251/aac4716.full">Estimating the reproducibility of psychological science</a>, which conducts replications of 100 published psychology studies and finds that roughly a third replicate, often with smaller effect sizes than reported in the original studies.</p>
<p>This led us to a review of frequentist statistics, which although somewhat of a <a href="https://www.mpib-berlin.mpg.de/pubdata/gigerenzer/Gigerenzer_2018_Statistical_rituals.pdf">statistical ritual</a>, is still an <a href="https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.XE8wl89KjRY">important one to understand</a>, for better or worse. A short quiz on the topic highlighted that it’s easy for newcomers and trained professionals alike to <a href="https://link.springer.com/article/10.1007%2Fs10654-016-0149-3">misunderstand</a> the meaning of p-values, hypothesis tests, and statistical significance. We reviewed null hypothesis testing through the lens of simulation, in contrast to the usual textbook approach of learning a battery of parametric tests.</p>
<p>At a high-level, null hypothesis testing asks “how (un)likely are the data I observed under a certain (null) model of the world”? If the data are sufficiently unlikely, we can reject this null model, otherwise our test is inconclusive. The catch is that we have to quantify what consititutes “sufficiently unlikely” and we have to make sure our experiment is actually powerful enough to reject the null when it’s false. In the Neyman-Pearson framework, we make choices based on the long-run error rates we’re willing to tolerate if this procedure is repeated over and over again. While this is usually taught using a reasonable amount of fancy math, we instead discussed it using brute force simulation, which allowed us to focus on the concepts instead of formulas and recipes. The basic idea is simple: if we’d like to know what to expect if the null model is actually true, we can just simulate many such experiments assuming it’s true, look at the distribution of outcomes, and compare what we actually see in the world to the results of our simulations. More details are in this notebook on
<a href="http://htmlpreview.github.io/?https://github.com/jhofman/msd2019/blob/master/lectures/lecture_5/statistical_inference.html">simulation-based statistical inference</a> and the <a href="https://github.com/jhofman/msd2019-notes/tree/master/lecture_5">scribed notes</a>.</p>
<p>We’ll continue this discussion of statistics, reproducibility, replication, and evaluating research next week.</p>
<p>References:</p>
<ul>
<li><a href="http://science.sciencemag.org/content/354/6317/1240.full">Enhancing reproducibility for computational methods</a> by Stodden et al.</li>
<li><a href="http://www.rctatman.com/files/2018-7-14-MLReproducability.pdf">A Practical Taxonomy of Reproducibility for Machine Learning Research</a> by Tatman, VanderPlas & Dane</li>
<li>A post on <a href="https://medium.com/@michel.steuwer/artifact-review-and-badging-855dc11b64a0">ACM’s Artifact Review and Badging</a></li>
<li><a href="http://science.sciencemag.org/content/349/6251/aac4716.full">Estimating the reproducibility of psychological science</a> from the Open Science Collaboration</li>
<li><a href="https://www.mpib-berlin.mpg.de/pubdata/gigerenzer/Gigerenzer_2018_Statistical_rituals.pdf">Statistical Rituals: The Replication Delusion and How We Got There</a> by Gigerenzer</li>
<li>The American Statistical Association’s <a href="https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.XE8wl89KjRY">statement on p-values</a> by Wasserstein & Lazar</li>
<li><a href="https://link.springer.com/article/10.1007%2Fs10654-016-0149-3">Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations</a> by Greenland et al.</li>
<li><a href="https://seeing-theory.brown.edu">Seeing Theory</a>, a visual, simulation-based tour of statistics</li>
<li>Chapters 12 and 13 of <a href="http://pluto.huji.ac.il/%7Emsby/StatThink/index.html">Introduction to Statistical Thinking (With R, Without Calculus)</a></li>
<li><a href="https://www.openintro.org/stat/textbook.php">Introductory Statistics with Randomization and Simulation</a></li>
<li>Statistics for Hackers by VanderPlas (<a href="https://speakerdeck.com/jakevdp/statistics-for-hackers">slides</a>, <a href="https://www.youtube.com/watch?v=Iq9DzN6mvYA">video</a>)</li>
</ul>
Fri, 22 Feb 2019 10:10:00 +0000Lecture 4: Data Visualization
http://modelingsocialdata.org/lectures/2019/02/15/lecture-4-data-visualization.html
http://modelingsocialdata.org/lectures/2019/02/15/lecture-4-data-visualization.html<p>We used this lecture to discuss <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_3/intro_to_r.ipynb">data manipulation</a> and <a href="https://github.com/jhofman/msd2019/blob/master/lectures/lecture_4/visualization_with_ggplot2.ipynb">data visualization </a> in R, specifically focusing on <a href="https://dplyr.tidyverse.org"><code class="highlighter-rouge">dplyr</code></a> and <a href="https://ggplot2.tidyverse.org"><code class="highlighter-rouge">ggplot2</code></a> from the <a href="http://tidyverse.org"><code class="highlighter-rouge">tidyverse</code></a>.</p>
<script async="" class="speakerdeck-embed" data-id="4540923077774710a34ba80dfc9c4dd5" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<p>The <code class="highlighter-rouge">tidyverse</code> relies on data being in a “tidy” format of one observation per row, one variable per column, and one value per cell. It provides tools for getting untidy data (of which there’s lots) into a tidy format. Once data are in this format, it provides tools for chaining together a string of commands, similar to unix pipes, that makes it very easy to translate ideas and question in your mind into working and readable code. This allows you to spend more time exploring and understanding your data and less time debugging code.</p>
<script async="" class="speakerdeck-embed" data-id="5bf041357fc24ff5b9cef83713baed0e" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
<p>We discussed visualization as a way to better understand data and as a way of communicating readers. We briefly reviewed experiments by <a href="http://www.jstor.org/stable/2288400?seq=1#page_scan_tab_contents">Cleveland and McGill</a> showing that not all visual encodings are created equal, <a href="http://dl.acm.org/citation.cfm?id=22950">Mackinlay’s</a> expressiveness / effectiveness tradeoff, and <a href="https://en.wikipedia.org/wiki/Leland_Wilkinson">Wilkinson’s</a> grammar of graphics. We spent a good amount of time discussing how every visualization should convey a point, preferrably one that can be summarized by a short sentence. These data visualization slides are generously adapted from <a href="http://hci.stanford.edu/~cagatay/">Çağatay Demiralp</a>.</p>
<p>Source code for the examples we reviewed are available on the course Github page: <a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_3">data manipulation</a>, <a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_4">data visualization</a>.</p>
<p>There are <a href="https://pinboard.in/u:jhofman/t:r/t:tutorials/">lots of R resources</a> available on the web, but here are a few highlights:</p>
<ul>
<li><a href="http://tryr.codeschool.com">CodeSchool</a> and <a href="https://www.datacamp.com/courses/free-introduction-to-r">DataCamp</a> intro to R courses</li>
<li>More about <a href="http://www.r-tutor.com/r-introduction/basic-data-types">basic types</a> (numeric, character, logical, factor) in R</li>
<li>Vectors, lists, dataframes: a <a href="http://www.statmethods.net/input/datatypes.html">one page reference</a> and [more details]</li>
<li>Chapters <a href="http://r4ds.had.co.nz/introduction.html">1</a>, <a href="http://r4ds.had.co.nz/explore-intro.html">2</a>, and <a href="http://r4ds.had.co.nz/transform.html">5</a> of <a href="http://r4ds.had.co.nz">R for Data Science</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/dplyr-data-manipulation-r-tutorial">Data Manipulation in R</a> tutorial</li>
<li>The <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html">dplyr vignette</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/2014/09/13/dplyr-intro.html">dplyr and pipes examples</a> (<a href="https://github.com/seananderson/dplyr-intro-2014">code</a> on github)</li>
<li>Rstudio’s <a href="http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">data wrangling cheatsheet</a></li>
<li>Hadley Wickham’s <a href="http://bit.ly/splitapplycombine">split/apply/combine</a> paper</li>
<li>The <a href="https://style.tidyverse.org">tidyverse style guide</a></li>
<li>Chapters <a href="http://r4ds.had.co.nz/data-visualisation.html">3</a>, <a href="http://r4ds.had.co.nz/exploratory-data-analysis.html">7</a>, and <a href="http://r4ds.had.co.nz/graphics-for-communication.html">28</a> in <a href="http://r4ds.had.co.nz/">R for Data Science</a></li>
<li>DataCamp’s <a href="https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/">Data Visualization with ggplot2</a> tutorial</li>
<li>Videos on <a href="http://varianceexplained.org/RData/lessons/lesson2/">Visualizing Data with ggplot2</a></li>
<li>Sean Anderson’s <a href="http://seananderson.ca/courses/12-ggplot2/ggplot2_slides_with_examples.pdf">ggplot2 slides</a> (<a href="(http://github.com/seananderson/datawranglR)">code</a>) for more examples</li>
<li>RStudio’s <a href="https://www.rstudio.com/resources/cheatsheets/">cheatsheets</a></li>
</ul>
Fri, 15 Feb 2019 10:10:00 +0000Lecture 3: Computational complexity
http://modelingsocialdata.org/lectures/2019/02/08/lecture-3-computational-complexity.html
http://modelingsocialdata.org/lectures/2019/02/08/lecture-3-computational-complexity.html<p>We had a guest lecture from <a href="http://sidsen.org/">Sid Sen</a> on computational complexity and algorithm analysis.</p>
<p><img src="http://modelingsocialdata.org/img/runtime_table.png" alt="Algorithm runtime in seconds, from Kleinberg & Tardos" /></p>
<p>Sid discussed various ways of analyzing how long algorithms take to run, focusing on worst-case analysis.
We discussed <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/asymptotic-notation">asymptotic notation</a> (<a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation">big-O</a> for upper bounds, <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-omega-notation">big-omega</a> for lower bounds, and <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-theta-notation">big-theta</a> for tight bounds).
The table above, from <a href="https://www.pearsonhighered.com/program/Kleinberg-Algorithm-Design/PGM319216.html">Algorithm Design</a> by Kleinberg and Tardos, shows how long we should expect different algorithms to run on modern hardware.
The key takeaway is that knowing how to match the right algorithm to your dataset is important.
For instance, when you’re dealing with millions of observations, only linear (or maybe <a href="https://en.wikipedia.org/wiki/Time_complexity#Linearithmic_time">linearithmic</a>) time algorithms are practical.</p>
<p>A few other references:</p>
<ul>
<li>A <a href="https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/">beginner’s guide</a> to big-O notation</li>
<li>Another <a href="https://www.interviewcake.com/article/python/big-o-notation-time-and-space-complexity">introduction to big-O</a></li>
<li>The <a href="http://bigocheatsheet.com/">big-O cheatsheet</a></li>
</ul>
<p>We touched upon a few more advanced topics around the tradeoff between how long something takes to run and how much space it requires. Sid gave a brief overview of <a href="https://brilliant.org/wiki/skip-lists/">skip lists</a> and mentioned some more recent work by his advisor, Robert Tarjan, on <a href="https://arxiv.org/abs/1806.06726v2">zip trees</a> (video lecture <a href="https://www.youtube.com/watch?v=NxRXhBur6Xs">here</a>).</p>
<p>Sid finished his lecture by discussing how this applies to something as simple as taking the intersection of two lists, useful for <a href="https://en.wikipedia.org/wiki/Join_(SQL)">joining</a> different tables.
A naive approach of comparing all pairs of elements takes quadratic time.
It’s relatively easy to do much better by <a href="https://en.wikipedia.org/wiki/Sort-merge_join">sorting and merging</a> the two sets, reducing this to <code class="highlighter-rouge">n log(n)</code> time.
And if we’re willing to trade space for time, we can use a <a href="https://en.wikipedia.org/wiki/Hash_table">hash table</a> to get the job done in linear time, known as a <a href="https://en.wikipedia.org/wiki/Hash_join">hash join</a>.</p>
<p>We used the end of lecture to revisit the command line and finish up a few leftover topics. See <a href="/lectures/2019/02/01/lecture-2-counting.html">last week’s post</a> for links to code from class.</p>
<!--
<center>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/ejmirP42ECxx3f" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe>
</center>
-->
<p>Next week we’ll discuss data manipulation in R. In preparation, make sure to <a href="/homework/2019/01/24/installing-tools.html">set up</a> R and the <a href="https://www.tidyverse.org">tidyverse</a> packages. If you’re new to R, in addition to the readings in R4DS book, check out the <a href="http://tryr.codeschool.com">CodeSchool</a> and <a href="https://www.datacamp.com/courses/free-introduction-to-r">DataCamp</a> intro to R courses. Also have a look at the slides and code we’ll discuss in class next week, which are <a href="https://github.com/jhofman/msd2019/tree/master/lectures/lecture_3">up on github</a>.</p>
Fri, 08 Feb 2019 00:00:00 +0000Homework 1
http://modelingsocialdata.org/homework/2019/02/07/homework-1.html
http://modelingsocialdata.org/homework/2019/02/07/homework-1.html<p>The first homework assignment, <a href="https://github.com/jhofman/msd2019/tree/master/homework/homework_1">posted on Github</a>, is due on Thursday, February 21 by 11:59pm ET.</p>
<p>The first problem explores various counting techniques, the second involves some command line and R counting exercises, and the third looks at the impact of inventory size on customer satisfaction for the MovieLens data.
Details are in the README.md file for each problem.</p>
<p>Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the <a href="https://courseworks2.columbia.edu/courses/77738">CourseWorks</a> site.
All code should be contained in plain text files and should produce the exact results you provide in your writeup.
Code should be written in bash / R and should not have complex dependencies on non-standard libraries.
The report should simply present your answers to the questions in an organized format as either a plain text or pdf file.
All work should be your own and done individually.</p>
Thu, 07 Feb 2019 17:00:00 +0000