Counting is surprisingly useful for understanding and summarizing social data. The key is figuring out what to count and how to count it efficiently.

While it’s a seemingly simple concept, counting can be quite challenging in practice, especially when dealing with large, multi-dimensional data.

First we discussed simple counting and uncertainty in the context of polling. We used simulations to determine how large of a poll to conduct to stay within a given margin of error. In practice, there are many sources of uncertainty in polling, which can often lead to much larger margins of error than these results imply. See Chapters 5 and 6 of Intro to Statistical Thinking (With R, Without Calculus) for background on binomial random variables and sampling distributions.

Then we discussed the split/apply/combine paradigm for counting and applied it to several examples from The Anatomy of the Long Tail. We also looked at alternative models for counting that trade off flexibility for scalability, such as streaming algorithms. Streaming allows us to compute statistics such as the mean or variance without having to read all of the data into memory first. We summarized these approaches and compared the types of statistics that can be computed under various conditions.

We concluded with an introduction to the command line, including some simple counting and exploration of the CitiBike trip data. Additional command line references and tutorials can be found in the installing tools post. All code and slides are on the course Github page.