Counting is surprisingly useful for understanding and summarizing social data. While it’s a seemingly simple concept, counting can be quite challenging in practice, especially when dealing with large, multi-dimensional data.
We discussed the split/apply/combine paradigm for counting and applied it to several examples from The Anatomy of the Long Tail. We also looked at alternative models for counting that trade off flexibility for scalability, such as streaming algorithms. Streaming allows us to compute statistics such as the mean or variance without having to read all of the data into memory first. We summarized these approaches and compared the types of statistics that can be computed under various conditions.
We concluded with more work on the command line, including some simple counting and exploration of the CitiBike trip data. Scripts are available on the course github page.
Some additional references for working in the shell:
- Lifehacker’s command line primer
- Software Carpentry’s slides/videos
- A short and concise introduction to the command line
- A wikibook on ad hoc data analysis
- An awk primer
- An extensive command line book