Final project reports

Fri, 17 May 2019 10:00:00 +0000

Below is a list of the final projects for the Spring 2019 semester, including a link to the original paper, the students’ final report, and all code and data necessary to reproduce the final report.

Group	Original paper	Replication report	Code
1	Wage disparity and team productivity: evidence from Major League Baseball, Craig A. Depken II, Economics Letters (2000)	Rmarkdown pdf	Github repository
2	Ethnicity, Insurgency, and Civil War, Fearon & Laitin, American Political Science Review (2003)	Rmarkdown pdf	Github repository
3	Greed and Grievance in Civil War, Collier & Hoeffler, Oxford Economic Papers (2004)	Rmarkdown pdf	Github repository
4	Predicting Positive and Negative Links in Online Social Networks, Leskovec et al., World Wide Web Conference (2010)	Rmarkdown pdf	Github repository
5	Predicting the Present with Google Trends, Choi & Varian, Economic Record (2012)	Rmarkdown pdf	Github repository
6	Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data, Muchlinski et al., Political Analysis (2016)	Rmarkdown pdf	Github repository
7	Housing, Health, and Happiness, Cattaneo et al., American Economic Journal: Economic Policy (2009)	Rmarkdown pdf	Github repository
8	Automated Hate Speech Detection and the Problem of Offensive Language, Davidson et al., International Conference on Weblogs and Social Media (2017)	Jupyter pdf	Github repository
9	Systematic Inequality and Hierarchy in Faculty Hiring Networks, Clauset et al., Science Advances (2015)	Rmarkdown pdf	Github repository
10	Chilling Effects: Online Surveillance and Wikipedia Use, Penney, Berkeley Technology Law Journal (2016)	Rmarkdown pdf	Github repository

Lecture 12: Causality & Experiments

Fri, 26 Apr 2019 10:00:00 +0000

In this lecture we discussed causal inference, randomized experiments, and natural experiments.

Most of what we’ve discussed in this class has focused on observational data—data obtained without direct intervention from or manipulation by those studying it. We can learn a lot from observational data and use it to find interesting relationships, build predictive models, or even to generate hypotheses, but it has it limits. This is often summarized by catchy phrases such as “correlation is not causation” or “no causation without manipulation”.

We opened this discussion by comparing two scenarios: (a) making a forecast about a static world with (b) trying to predict what happens when you change something in the world. For the former you might do well by simply recognizing correlations (e.g., seeing my neighbor with an umbrella might predict rain), but the latter requires a more robust model of the world (e.g., handing my neighbor an umbrella is unlikely to cause rain). We discussed the idea of trying to estimate the “effects of causes”, touching on the potential outcomes and causal graphical model frameworks.

Using the effect of hospitalization on health as an example, we talked about confounding factors that complicate causal inference. For instance, my health today might affect both whether I go to the hospital as well as my health tomorrow, making it difficult to isolate the effect of hospitalization on health from other factors. We saw this mathematized in what Varian calls the “basic identity of causal inference”: observational estimates conflate the average treatment effect with selection bias, where selection bias measures the baseline difference between those who opted into treatment and those who didn’t. We also discussed Simpson’s paradox, where selection bias is so large that it leads to a directionally incorrect estimate of a causal effect: what appears to be a positive correlation without adjusting for possible confounds can in fact become a negative one when all available information is accounted for.

We then introduced counterfactuals and randomized experiments. The question you’d really like to answer is this: if you cloned each person and sent one copy of that person to the hospital, but not the other, what would the resulting difference in health be? Short of being able to do this, we could ask a slightly different question: if we had two groups of people who were nearly identical in every way and we sent one group to the hospital, but not the other, how would the health of the two groups differ? This is precisely the idea behind randomized experiments, such as clinical trials in medicine and A/B testing for online platforms. Randomization is key here, as it provides a way of creating two groups that are as similar as possible prior to the treatment (e.g., hospitalization) being administered: if people are randomly assigned to groups, then there shouldn’t be any systematic difference between the two groups, eliminating selection bias. Since the only difference between the groups is that one gets treated and the other doesn’t, we can ascribe differences in the outcome to the treatment.

While randomized experiments are the “gold standard” for causal inference, we discussed some caveats and limitations in traditional approaches to experimentation in the social sciences, covering issues of both “internal” and “external” validity. The first asks whether the experiment was properly designed to isolate the intended effect, whereas the second asks if we should expect the results of the study to generalize to other scenarios.

We discussed natural experiments as an alternative, where the idea is to exploit naturally occuring variation to tease out causal effects from observational data. We followed Dunning’s treatment of instrumental variables (IV) by looking at randomized experiments with non-compliance, where there’s a difference between assignment to treatment (e.g., whether you’re told to take a drug) versus receipt of treatment (e.g., whether you actually take it). The basic idea is that we can estimate two separate quantities: the effect of being assigned a treatment and the odds of actually complying with that assignment. Dividing the former by the latter provides an estimate of the causal effect of actually receiving the treatment. Furthermore, we can extend this analysis to situations in which nature provides the randomization instead of a researcher flipping a coin, in which case the source of randomness is referred to as an “instrument” that systematically shifts the probability of being treated. Classic examples include lotteries or weather events. We finished with a brief discussion of regression discontinuity and difference-in-difference designs as well.

References:

The Book of Why by Pearl and Mackenzie
Understanding Simpson’s Paradox by Judea Pearl
Chapter 21 of Advanced Data Analysis from an Elementary Point of View
Chapters 1 and 2 of Field Experiments: Design, Analysis, and Interpretation
Matt Blackwell’s lecture notes on causality and potential outcomes as well as randomized experiments
Some notes on causal inference from Andrew Gelman
Resilient cooperators stabilize long-run cooperation in the finitely repeated Prisoner’s Dilemma by Mao, Dworkin, Suri & Watts
Causal inference in economics and marketing by Hal Varian
Instrumental Variables by Thad Dunning (followup here)
See Chapter 5 of Natural Experiments in the Social Sciences by Dunning for more detail
Exercise contagion in a global social network by Aral & Nicolaides

Lecture 11: Networks II

Fri, 12 Apr 2019 10:00:00 +0000

We spent this lecture discussing representations and characteristics of networks and algorithms for analyzing network data.

After discussing many different types of networks that we might analyze as well as the various levels of abstraction available for representing them, we turned to algorithms for efficiently computing shortest path lengths, connected components, mutual friends, and clustering coefficients.

We started with the problem of finding the shortest distance between a single source node and all other nodes in a (undirected, unweighted) network, as measured by the fewest number of edges you need to traverse to get from the source to every other node. (Every researcher’s favorite version of this is computing their Erdős number, the academic take on the more well-known Kevin Bacon game. Compute yours here.)

Breadth first search (BFS) provides a nice solution. The intuition behind BFS is simple: we start from the source node and mark it as distance zero from itself. Then we visit each of its neighbors and mark those as distance one. We repeat this iteratively, pushing forward a boundary of recently discovered nodes that are one additional hop from the source at each step. BFS visits each node and edge in a network once, scaling linearly in the size of the network. If, however, we would like to find the shortest distance between all pairs of nodes then we must repeat this for each possible source node, and so this quickly becomes prohibitively expensive for even moderately sized networks. (See here for fancier, more efficient algorithms.)

Next we looked at using BFS for a related problem: finding the number of connected components, or separate pieces, of a network. We did this by simply looping over our shortest path code, seeding it on each iteration with a currently unreachable node as the source until we reach all nodes. We gave the reachable nodes in each BFS a unique label corresponding to its component.

Then we moved on to computing the number of friends that any two nodes have in common, motivated by the problem of friend recommendations on social networks. The underlying idea can be traced back to Granovetter: two people are likely to know each other if they have many mutual friends. To compute the number of mutual friends between all pairs of nodes, we exploit the fact that the neighbors of every node share that node as a common friend. To count all mutual friends we simply loop over each node and increment a counter for every pair of its neighbors. For each node this scales as the square of its degree, so the whole algorithm scales as the sum of the squared degrees of all nodes. This can quickly become expensive if we have even a few high-degree nodes, which are quite common in practice.

Finally, we looked at the closely related problem of counting the number of triangles around each node in a network. This algorithm is nearly identical to computing mutual friends, as we generate the same set of two-hop paths through all pairs of a node’s neighbors, but simply increment different counters to generate different results. Instead of accumulating mutual friends for each pair of a node’s neighbors, we ask whether every pair of neighbors are themselves directly connected. If so, we count this as (half of) a triangle in which the node participates. Dividing the number of closed triangles in a network by the number of possible triangles that could be present gives a useful for how clustered a network is.

To better understand properties of networks and how to compute them, we looked at a few example networks in R using the igraph package. See the notebooks on the course GitHub page for related code and data used in the lectures.

We finished lecture with a preview of issues related to causal inference.

References:

Chapters 2, 18, and 20 of Easley and Kleinberg’s Networks, Crowds, and Markets
Collective dynamics of ‘small-world’ networks by Watts & Strogatz
Four degrees of separation: scaling up calculations to the entire Facebook social graph
Customizable route planning: how shortest path calculations are done in modern mapping applications
These slides on the early system for friend recommendation on Facebook (pages 28 to 37)
The Curse of the Last Reducer
A Model of Computation for MapReduce

Homework 4

Wed, 10 Apr 2019 10:00:00 +0000

The fourth homework assignment, posted on Github, is due on Thursday, April 25 by 11:59pm ET.

The first problem explores the small-world phenomenon in “close” vs. “distant” friend networks, the second studies how the structure of an email network changes as we remove weak ties from it, and the third looks at gender assortativity in networks. Details are in the README.md file for each problem.

Your code and results are to be submitted electronically in one zipped (or tarball-ed) file through the CourseWorks site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.

Lecture 10: Networks I

Fri, 05 Apr 2019 10:00:00 +0000

We used this lecture to first go through applications of logistic regression and then to discuss the history of network science.

We started off this lecture by revisiting logistic regression, looking at the problem of modeling which passengers survived the Titanic disaster. We saw that interpreting logistic regression results can be challenging, as coefficients give information about changes in log-odds (as opposed to probabilities directly). We stressed the idea of converting back to probabilities and visually comparing predicted and actual values for a range of feature values to better understand the model fit. See this notebook for details.

Next we discussed Vowpal Wabbit (VW), an open source tool for various machine learning tasks. VW has many attractive features, such as a flexible input format, speed, scalability, and sensible defaults. For binary classification, VW defaults to fitting a (clipped) linear model to minimize squared loss. We looked at an example of classifying news with VW to get a sense of the interface and performance, which is quite competetive.

Then we moved on to a history of nertwork science.

We talked about some of the earliest studies of networks, such as Jacob Moreno’s sociograms and Mark Granovetter’s work on the strength of weak ties. We contrasted theoretical models of graphs (e.g., Erdős–Rényi random graphs) to real-world networks, which tend to have highly skewed degree distributions as originally discussed in Derek de Solla Price’s studies of citation networks. At the same time, social networks typically have short path lengths, in the sense that one needs only to traverse a handful of links to connect a randomly selected set of people in the network.

We finished by discussing different types of networks that we might analyze as well as the various levels of abstraction available for representing them.

Lecture 9: Classification

Fri, 29 Mar 2019 10:00:00 +0000

In this lecture we covered classification with linear models, specifically naive Bayes and logistics regression.

We started this lecture by introducing the problem of classification and how it differs from regression: the outcome is categorical (e.g., whether an email is spam or ham) rather than continuous. We first reviewed Bayes’ rule for inverting conditional probabilities via a simple, but somewhat counterintuitive, medical diagnosis example and then adapted this to an (extremely naive) one-word spam classifier. We improved upon this by considering all words present in a document and arrived at naive Bayes—a simple linear method for classification in which we model each word occurrence independently and use Bayes’ rule to calculate the probability the document belongs to each class given the words it contains.

Although naive Bayes makes an obviously incorrect assumption that all features are independent, it turns out to be a reasonably useful method in practice. It’s simple and scalable to train, easy to update as new data arrive, easy to interpret, and often more competitive in performance than one might expect. That said, there are some obvious issues with naive Bayes as presented, namely overfitting in the training process and overconfidence / miscalibration when making predictions.

The first issue arises when thinking about how to estimate word probabilities. Simple maximum likelihood estimates (MLE) for word probabilities lead to overfitting, implying, for instance, that it’s impossible to see a word in a given class in the future if we’ve never seen it occur in that class in the past. We dealt with this by thinking about maximum a posteriori (MAP) estimation which led to the idea of Laplace smoothing, or adding pseudocounts to empirical word counts to prevent overfitting. As usual, determining the amount of smoothing to use is an empirical question, often solved by methods such as cross-validation.

As for the second problem of feature independence, we addressed this by abandoning naive Bayes in favor of logistic regression. Logistic regression makes predictions using the same functional form as naive Bayes—the log-odds are modeled as a weighted combination of feature values—but fits these weights in a manner that accounts for correlations between features. We (once again) applied the maximum likelihood principle to arrive at criteria for estimating these weights, and discussed gradient descent for a solution. The resulting algorithms are very close in spirit to those for linear regression, but slightly more complex due to the logistic function. And, similar to linear regression, we discussed the idea of regularizing logistic regression by including a term in the loss function to penalize large weight vectors.

We concluded with a discussion of several metrics for evaluating classifiers, including calibration, confusion matrices, accuracy, precision and recall, and the ROC curve. See the classification notebook up on Github for more details.

A few references:

Chapter 12 of Advanced Data Analysis from an Elementary Point of View
Chapter 4 of An Introduction to Statistical Learning
Naive Bayes at 40 by Lewis (1998)
Idiots Bayes—Not So Stupid After All? by Hand and Yu (2001)
A Bayesian Approach to Filtering Junk E-mail from Sahami, Dumais, Heckerman, and Horvitz (1998)
A Plan for Spam by Paul Graham (2002)
An introduction to ROC analysis
Understanding ROC curves
Vowpal Wabbit for scalable classification

Homework 3

Fri, 29 Mar 2019 10:00:00 +0000

The third homework assignment, posted on Github, is due on Thursday, April 11 by 11:59pm ET.

The first problem explores various modeling scenarios, the second looks at cross-validation for polynomial regression, and in the third you’ll use regularized logistic regression to classify New York Times articles. Details are in the README.md file for each problem.

Lecture 8: Regression, Part 2

Fri, 15 Mar 2019 10:00:00 +0000

This was the second lecture on the theory and practice of regression, focused on model complexity and generalization.

We started with an applied modeling problem: understanding how internet browsing activity varies by age and gender. We saw that there’s a lot more to modeling than just optimization, with many important steps along the way that range from collecting and specifying outcomes and predictors, to determining the form of a model, to assessing performance and interpreting results. We found that including quadratic terms for age and interacting gender with age gave a reasonable model, at least in terms of visually matching empirical aggregates. See the linear models notebook up on Github for more details.

Then we talked about two high-level points. First, quantifying model fit and second, knowing when to stop fitting. In this case, that translates to asking “how good is a quadratic fit” and “why shouldn’t I use a cubic, or quartic, etc.?” or “should I add another interaction?”

To the first point, we discussed root mean squared error (RMSE) and the coefficient of determination ( $R^2$ ) as sensible metrics of model fit. RMSE is just the squared loss function we discussed last time, with a square root to adjust units to match those of the outcome we’re trying to predict. It’s useful when we already have a sense of absolute scale for “what’s good”. The coefficient of determination, on the other hand, captures the fraction of variance in outcomes explained by the model, and is useful when we don’t have such a scale or are comparing across different problems. We showed that this is the same as comparing the mean squared error (MSE) of the model to the MSE of a simple baseline where we always predict the average outcome. Finally, we discussed the connection between Pearson’s correlation coefficient and $R^2$ . See here for a proof that the latter is in fact the square of the former. See the model evaluation notebook on Github for details.

Applying both of these metrics to the pageview dataset, we saw that while there were systematic trends in typical viewing behavior by age and gender, there was still a surprisingly large amount of variation in individual activity for people of the same age and gender.

This led us to our second high-level topic, the question of complexity control: How complicated should we make our model? We discussed the idea of generalization error, and how we’d like models that are both complex enough to account for the past and simple enough to predict the future. Cross-validation is the most common approach to navigating this tradeoff, where we divide our data into a training set for fitting models, a validation set for comparing these different fits, and a test set that’s used once (and only once) to quote the expected future performance of the model we end up selecting. We talked about k-fold cross-validation as a more statistically robust version of estimating generalization error. See the complexity control notebook on Github for details.

We also phrased this issue in terms of the bias-variance tradeoff. Simple models are likely biased in that they systematically misrepresent the world, and would do so even with an infinite amount of data. At the same time, estimating a simple model is a low variance procedure in that our results don’t change substantially when we fit it on different samples of data. More flexible models, on the other hand, have little bias and can capture more complex patterns in the world. The downside is that this flexibility also renders such models sensitive to noise, often leading to high variance, or drastically different results with different samples of the data.

We concluded lecture with a brief discussion of regularization as a way of modifying loss functions to improve the generalization error of our models by explicitly balancing the fit to the training data with the “complexity” of the model. The idea is that introducing some bias in our models is sometimes a good idea if the corresponding reduction in variance is enough to lower the mean squared error.

See Github for an introduction to glmnet as well as this interactive Shiny App to explore regularization.

References:

Chapter 2 of An Introduction to Statistical Learning on the bias-variance tradeoff
Section 1.4 of Advanced Data Analysis from an Elementary Point of View on the same, with a more detailed derivation
Chapter 5 of An Introduction to Statistical Learning and 3 of Advanced Data Analysis from an Elementary Point of View on resampling and cross-validation
Recent work on using differentially private mechanisms for reusing holdout sets
Chapters 23 and 24 of R for Data Science
The modelr and tidymodels packages in R
The glmnet vignette

Lecture 7: Regression, Part 1

Fri, 08 Mar 2019 10:00:00 +0000

This was the first of two lectures on the theory and practice of regression.

In the first part of class we shifted from talking about problems in how science is often done to best practices for doing good science. We went through the pipeline of designing a study, piloting and revising it, doing a power calculation, pre-registering the study, running it, creating a reproducible analysis and report, and thinking critically about the results.

Next we moved on to regression. We started with a high-level overview of regression, which can be broadly defined as any analysis of how one continuous variable (the “outcome”) changes with others (the “inputs”, “predictors”, or “features”). The goals of a regression analysis can vary, from describing the data at hand, to predicting new outcomes, to explaining the associations between outcomes and predictors. This includes everything from looking at histograms and scatter plots to building statistical models.

We focused on the latter and discussed ordinary least squares regression. First, we motivated this as an optimization problem and then connected squared loss minimization to the more general principle of maximum likelihood. Then we discussed several ways to solve this optimization problem to estimate coefficients for a linear model, which are summarized in the table below.

Method	Space	Time	Comments
Invert normal equations	$N K + K^2$	$K^3$	Good for medium-sized datasets with a relatively small number (e.g., hundreds or thousands) of features
Gradient descent	$N K$	$NK$ per step	Good for larger datasets that still fit in memory but have more (e.g., millions) features; requires tuning learning rate
Stochastic gradient descent	$K$	$K$ per step	Good for datasets that exceed available memory; more sensitive to learning rate schedule

See also this interactive Shiny App to explore manually fitting a simple model and this notebook by Jongbin Jung with an animation of gradient descent.

References:

Chapter 3 of An Introduction to Statistical Learning
Chapters 1 and 2 of Advanced Data Analysis from an Elementary Point of View
Chapter 5 of OpenIntro’s Introductory Statistics with Randomization and Simulation
Statistical Models by David Freedman
Regression Analysis by Richard Berk
Chapters 23 and 24 of R for Data Science

Lecture 6: Reproducibility and replication, Part 2

Fri, 01 Mar 2019 10:00:00 +0000

This was our second lecture on reproducibility and replication in which we discussed false discoveries, effect sizes, and p-hacking / researcher degrees of freedom.

The previous lecture provided a high-level overview of the ongoing replication crisis in the sciences. In this lecture we continued the discussion, first by talking about false discoveries. Following Felix Schönbrodt’s excellent blog post, we talked about how underpowered studies lead to false discoveries. Then we went on to discuss effect sizes, specifically Cohen’s d and the AUC, through this excellent visual tool.

Next we spoke about post-hoc data analysis and p-hacking. We looked at the False-Positive Psychology paper by Simmons, Nelson & Simonsohn, which has an illustrative example of how one can arrive at non-sensical conclusions if there’s enough flexibility in data collection and analysis. Gelman and Loken’s The Garden of Forking Paths makes a similar point, noting that this can often occur without mal intent on the part of the researcher. While these issues are complex, there are few best practices (e.g., running pilot studies followed by pre-registration of high-powered, large-scale experiments) that can help mitigate these concerns. Registered reports are a particularly attractive solution, wherein researchers write up and submit an experimental study for peer review before the study is conducted. Reviewers make an acceptance decision at this point based on the merit of the study, and, if accepted, it is published regardless of the results. We also discussed how these ideas that come largely from randomized experiments might be adapted for observational studies.

We finished up class by talking about a few tools for computational reproducibility, specifically RMarkdown for reproducible documents and Makefiles for efficient workflows. Example files are up on Github.

References:

A guide on effect sizes and related blog post
Interpreting Cohen’s d effect size
The New Statistics: Why and How by Cummings
The Insignificance of Significance Testing by Johnson
The Insignificance of Null Hypothesis Significance Testing by Gill
Why Most Published Research Findings Are False
Felix Schönbrodt’s blog post and shiny app on misconceptions about p-values and false discoveries
Calculating the power of a test
Power failure: why small sample size undermines the reliability of neuroscience by Button, et. al.
False-Positive Psychology by Simmons, Nelson & Simonsohn
The garden of forking paths by Gelman & Loken
The cumulative effect of reporting and citation biases on the apparent efficacy of treatments by de Vries et al. (popular coverage)
Pre-registration portals from the Open Science Framework, Center for Open Science, and AsPredicted.org
Science magazine’s announcement of registered reports
Why Use Make by Mike Bostock
GNU Make for Reproducible Data Analysis
RMarkdown cheatsheet
RStudio’s RMarkdown site
The RMarkdown: The Definitive Guide book

Homework 2

Thu, 28 Feb 2019 17:00:00 +0000

The second homework assignment, posted on Github, is due on Thursday, March 14 by 11:59pm ET.

The first problem looks at the link between coffee and cancer, the second problem examines an experiment on whether yawning is contagious, and the third problem involves replicating the results of a paper about the Google ngram dataset. Details are in the README.md file for each problem.

Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the CourseWorks site. All solutions should be contained in the corresponding files provided here. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. Each problem contains an Rmarkdown file which should be rendered as a pdf to be submitted with your solution. All work should be your own and done individually.

Lecture 5: Reproducibility and replication, Part 1

Fri, 22 Feb 2019 10:10:00 +0000

We discussed the ongoing replication crisis in the sciences, wherein it has proven difficult or impossible for researchers to independently verify results of previously published studies.

We started off the lecture by talking about how to evaluate research findings. Namely, how can you assess whether the results of a study are believable and/or important?

We took the optimistic view that most researchers are honest, although there are some exceptions. For instance, a recent study by LaCour and Green reported that a single conversation with canvassers had lasting impact on support for gay marriage. But soon after the study was published, Broockman, Kalla, and Aronow found some irregularities in the data. The paper was later retracted on account of the data being fabricated using the results of a previous study. Broockman and Kalla then proceeded to carry out their own version of such a study, and ironically found support for the original hypothesis.

While such instantces of fraud are rare, there are other, more common concerns among published studies. The first is reproducibility, or whether one can independently verify the results of a study with the same data and same code used in the original paper. Though a low bar, most research currently doesn’t pass this test simply because it’s often the case that papers are published without all of the supporting data or code. And when the data and code are available, the code can be surprisingly difficult to understand or run, especially when there are complex software dependencies. This is improving as researchers adopt better software engineering practices and develop guildelines, best practices, and incentives for reproducibility.

Next we discussed replicability, or whether a result holds when a study is repeated with new data but the same analysis as the original paper. The main issue here is that it’s easy to be fooled by randomness because noise can dominate signal in small datasets and asking too many questions of the data can lead to overfitting, even with large datasets. We looked at a seminal paper from the Open Science Collaboration, Estimating the reproducibility of psychological science, which conducts replications of 100 published psychology studies and finds that roughly a third replicate, often with smaller effect sizes than reported in the original studies.

This led us to a review of frequentist statistics, which although somewhat of a statistical ritual, is still an important one to understand, for better or worse. A short quiz on the topic highlighted that it’s easy for newcomers and trained professionals alike to misunderstand the meaning of p-values, hypothesis tests, and statistical significance. We reviewed null hypothesis testing through the lens of simulation, in contrast to the usual textbook approach of learning a battery of parametric tests.

At a high-level, null hypothesis testing asks “how (un)likely are the data I observed under a certain (null) model of the world”? If the data are sufficiently unlikely, we can reject this null model, otherwise our test is inconclusive. The catch is that we have to quantify what consititutes “sufficiently unlikely” and we have to make sure our experiment is actually powerful enough to reject the null when it’s false. In the Neyman-Pearson framework, we make choices based on the long-run error rates we’re willing to tolerate if this procedure is repeated over and over again. While this is usually taught using a reasonable amount of fancy math, we instead discussed it using brute force simulation, which allowed us to focus on the concepts instead of formulas and recipes. The basic idea is simple: if we’d like to know what to expect if the null model is actually true, we can just simulate many such experiments assuming it’s true, look at the distribution of outcomes, and compare what we actually see in the world to the results of our simulations. More details are in this notebook on simulation-based statistical inference and the scribed notes.

We’ll continue this discussion of statistics, reproducibility, replication, and evaluating research next week.

References:

Enhancing reproducibility for computational methods by Stodden et al.
A Practical Taxonomy of Reproducibility for Machine Learning Research by Tatman, VanderPlas & Dane
A post on ACM’s Artifact Review and Badging
Estimating the reproducibility of psychological science from the Open Science Collaboration
Statistical Rituals: The Replication Delusion and How We Got There by Gigerenzer
The American Statistical Association’s statement on p-values by Wasserstein & Lazar
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations by Greenland et al.
Seeing Theory, a visual, simulation-based tour of statistics
Chapters 12 and 13 of Introduction to Statistical Thinking (With R, Without Calculus)
Introductory Statistics with Randomization and Simulation
Statistics for Hackers by VanderPlas (slides, video)

Lecture 4: Data Visualization

Fri, 15 Feb 2019 10:10:00 +0000

We used this lecture to discuss data manipulation and data visualization in R, specifically focusing on dplyr and ggplot2 from the tidyverse.

The tidyverse relies on data being in a “tidy” format of one observation per row, one variable per column, and one value per cell. It provides tools for getting untidy data (of which there’s lots) into a tidy format. Once data are in this format, it provides tools for chaining together a string of commands, similar to unix pipes, that makes it very easy to translate ideas and question in your mind into working and readable code. This allows you to spend more time exploring and understanding your data and less time debugging code.

We discussed visualization as a way to better understand data and as a way of communicating readers. We briefly reviewed experiments by Cleveland and McGill showing that not all visual encodings are created equal, Mackinlay’s expressiveness / effectiveness tradeoff, and Wilkinson’s grammar of graphics. We spent a good amount of time discussing how every visualization should convey a point, preferrably one that can be summarized by a short sentence. These data visualization slides are generously adapted from Çağatay Demiralp.

Source code for the examples we reviewed are available on the course Github page: data manipulation, data visualization.

There are lots of R resources available on the web, but here are a few highlights:

CodeSchool and DataCamp intro to R courses
More about basic types (numeric, character, logical, factor) in R
Vectors, lists, dataframes: a one page reference and [more details]
Chapters 1, 2, and 5 of R for Data Science
DataCamp’s Data Manipulation in R tutorial
The dplyr vignette
Sean Anderson’s dplyr and pipes examples (code on github)
Rstudio’s data wrangling cheatsheet
Hadley Wickham’s split/apply/combine paper
The tidyverse style guide
Chapters 3, 7, and 28 in R for Data Science
DataCamp’s Data Visualization with ggplot2 tutorial
Videos on Visualizing Data with ggplot2
Sean Anderson’s ggplot2 slides (code) for more examples
RStudio’s cheatsheets

Lecture 3: Computational complexity

Fri, 08 Feb 2019 00:00:00 +0000

We had a guest lecture from Sid Sen on computational complexity and algorithm analysis.

Sid discussed various ways of analyzing how long algorithms take to run, focusing on worst-case analysis. We discussed asymptotic notation (big-O for upper bounds, big-omega for lower bounds, and big-theta for tight bounds). The table above, from Algorithm Design by Kleinberg and Tardos, shows how long we should expect different algorithms to run on modern hardware. The key takeaway is that knowing how to match the right algorithm to your dataset is important. For instance, when you’re dealing with millions of observations, only linear (or maybe linearithmic) time algorithms are practical.

A few other references:

We touched upon a few more advanced topics around the tradeoff between how long something takes to run and how much space it requires. Sid gave a brief overview of skip lists and mentioned some more recent work by his advisor, Robert Tarjan, on zip trees (video lecture here).

Sid finished his lecture by discussing how this applies to something as simple as taking the intersection of two lists, useful for joining different tables. A naive approach of comparing all pairs of elements takes quadratic time. It’s relatively easy to do much better by sorting and merging the two sets, reducing this to n log(n) time. And if we’re willing to trade space for time, we can use a hash table to get the job done in linear time, known as a hash join.

We used the end of lecture to revisit the command line and finish up a few leftover topics. See last week’s post for links to code from class.

Next week we’ll discuss data manipulation in R. In preparation, make sure to set up R and the tidyverse packages. If you’re new to R, in addition to the readings in R4DS book, check out the CodeSchool and DataCamp intro to R courses. Also have a look at the slides and code we’ll discuss in class next week, which are up on github.

Homework 1

Thu, 07 Feb 2019 17:00:00 +0000

The first homework assignment, posted on Github, is due on Thursday, February 21 by 11:59pm ET.

The first problem explores various counting techniques, the second involves some command line and R counting exercises, and the third looks at the impact of inventory size on customer satisfaction for the MovieLens data. Details are in the README.md file for each problem.

Your code and a brief report with your results are to be submitted electronically in one zipped (or tarball-ed) file through the CourseWorks site. All code should be contained in plain text files and should produce the exact results you provide in your writeup. Code should be written in bash / R and should not have complex dependencies on non-standard libraries. The report should simply present your answers to the questions in an organized format as either a plain text or pdf file. All work should be your own and done individually.