This was our second lecture on causality and experimentation, in which we discussed statistical inference and reproducibility for randomized experiments as well as the design and analysis of natural experiments.

The previous lecture provided a high-level overview of experimentation, focusing on randomized experiments as the gold standard for causal inference. In the first part of this lecture we discussed how to reliably design and analyze randomized experiments. We began with a review of statistical inference, following Yakir’s approach of using simulations to look at sampling distributions, point estimates, confidence intervals, hypothesis testing, and power calculations. The basic idea is that you can circumvent a good deal of theory and simulate things directly by repeatedly sampling data to arrive at the usual results for inference and testing. This has the downside that it’s computationally expensive, but the upside that it presents statistics in a clear, concrete, and practical manner. See here for the code, and the visually appealing Seeing Theory site for more.

Then we discussed several ways in which randomized experiments can go wrong, including small samples sizes, multiple hypothesis testing, post-hoc data analysis and p-hacking.
The combination of these effects has led to a replication crisis in the social sciences, wherein researchers have found that a number of published experimental findings have failed to replicate in followup studies.
Following Felix Schönbrodt’s excellent blog post, we discussed how underpowered studies lead to false discoveries.
While these issues are complex, there are few best practices (e.g., running pilot studies followed by pre-registration of high-powered, large-scale experiments) that can help mitigate these concerns.
Registered reports are a particularly attractive solution, wherein researchers write up and submit an experimental study for peer review *before* the study is conducted.
Reviewers make an acceptance decision at this point based on the merit of the study, and, if accepted, it is published regardless of the results.
Finally, we briefly touched on ethical and practical concerns of running randomized experiments, looking at Facebook’s study of emotional contagion and Kohavi et. al.’s practical tips for running A/B tests.

In the second part of lecture we moved on to natural experiments. We followed Dunning’s treatment of instrumental variables (IV) by looking at randomized experiments with non-compliance, where there’s a difference between assignment to treatment (e.g., whether you’re told to take a drug) versus receipt of treatment (e.g., whether you actually take it). The basic idea is that we can estimate two separate quantities: the effect of being assigned a treatment and the odds of actually complying with that assignment. Dividing the former by the latter provides an estimate of the causal effect of actually receiving the treatment. Furthermore, we can extend this analysis to situations in which nature provides the randomization instead of a researcher flipping a coin, in which case the source of randomness is referred to as an “instrument” that systematically shifts the probability of being treated. Classic examples include lotteries or weather events. We briefly looked an example of the latter in a recent paper that uses random variations in weather to study peer effects of exercise in social networks. We concluded with a discussion about the benefits and limitations of traditional approaches to finding and arguing for valid instruments, and looked at an example of data-driven approaches to finding instruments.

References:

- Chapters 12 and 13 of an Introduction to Statistical Thinking (With R, Without Calculus)
- Why Most Published Research Findings Are False
- Instrumental Variables by Thad Dunning (followup here)
- See Chapter 5 of Natural Experiments in the Social Sciences by Dunning for more detail
- Seeing Theory, a visual, simulation-based tour of statistics
- Felix Schönbrodt’s blog post and shiny app on misconceptions about p-values and false discoveries
- Calculating the power of a test
- Estimating the reproducibility of psychological science by Nosek, et. al.
- Power failure: why small sample size undermines the reliability of neuroscience by Button, et. al.
- False-Positive Psychology by Simmons, Nelson & Simonsohn
- Science magazine’s announcement of registered reports
- Pre-registration portals from the Open Science Framework, Center for Open Science, and AsPredicted.org
- Experimental evidence of massive-scale emotional contagion through social networks by Kramer, Guillory & Hancock
- The garden of forking paths by Gelman & Loken
- Instrumental variables for clincal trials discussed in the New York Times
- Exercise contagion in a global social network by Aral & Nicolaides
- Estimating the causal impact of recommendation systems from observational data by Sharma, Hofman & Watts