In this lecture we discussed combining and reshaping data in R as well as counting at scale with MapReduce.

First we extended last week’s discussion of data manipulation in R by looking at the various joins (inner, left, full, and anti) for combining different tables available in dplyr. Then we used the tidyr package to reshape data that comes in inconvenient formats (e.g., from long to wide with spread, or vice versa with gather).

See this Jupyter notebook for more details. Additional readings include Chapter 12 of R for Data Science for tidyr and Chapter 13 for joins. There are also useful vignettes for two-table verbs in dplyr and tidy data with tidyr.

In the second half of class we talked about counting at scale with MapReduce. At its core, MapReduce is a distributed system for solving the split/apply/combine problem at scale, essentially functioning as a distributed group-by operation. The programmer implements a map function, which defines how records should be split in to groups and a reduce function that defines what to compute within each group. The system takes care of the rest of the complex engineering details, from distributed storage to fault tolerance, in a manner that makes the parallelism virtually transparent to the programmer.

Hadoop is a popular open source implementation of the MapReduce paradigm. We discussed how Hadoop Streaming can be used to scale existing code, and briefly looked at higher-level languages that abstract away some low-level MapReduce details from the programmer. For instance, Pig is a high-level language that converts sequences of common data analysis operations (e.g., filter, sort, join, group by, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster. Hive is similar, but follows the SQL paradigm more closely.

See this CACM article and Chapter 2 of Mining Massive Data Sets for more on MapReduce. Michael Noll also has a nice tutorial. And code for the wordcount example we covered in class is on the course Github page.