Counting is surprisingly useful for understanding and summarizing social data. The key is figuring out what to count and how to count it efficiently.

While it’s a seemingly simple concept, counting can be quite challenging in practice, especially when dealing with large, multi-dimensional data.

We discussed the split/apply/combine paradigm for counting and applied it to several examples from The Anatomy of the Long Tail. We also looked at alternative models for counting that trade off flexibility for scalability, such as streaming algorithms. Streaming allows us to compute statistics such as the mean or variance without having to read all of the data into memory first. We summarized these approaches and compared the types of statistics that can be computed under various conditions.

We concluded with more work on the command line, including some simple counting and exploration of the CitiBike trip data. Slides and code including an “Introduction to the command line” notebook are available on the course github page.

Additional command line references can be found in the installing tools post.