In this lecture we discussed combining and reshaping data in R as well as counting at scale with MapReduce.
First we extended last week’s discussion of data manipulation in R by looking at the various joins (inner, left, full, and anti) for combining different tables available in dplyr
.
Then we used the tidyr
package to reshape data that comes in inconvenient formats (e.g., from long to wide with spread
, or vice versa with gather
).
See this Jupyter notebook for more details.
Additional readings include Chapter 12 of R for Data Science for tidyr
and Chapter 13 for joins.
There are also useful vignettes for two-table verbs in dplyr
and tidy data with tidyr
.
In the second half of class we talked about counting at scale with MapReduce.
At its core, MapReduce is a distributed system for solving the split/apply/combine problem at scale, essentially functioning as a distributed group-by operation.
The programmer implements a map
function, which defines how records should be split in to groups and a reduce
function that defines what to compute within each group.
The system takes care of the rest of the complex engineering details, from distributed storage to fault tolerance, in a manner that makes the parallelism virtually transparent to the programmer.
Hadoop is a popular open source implementation of the MapReduce paradigm. We discussed how Hadoop Streaming can be used to scale existing code, and briefly looked at higher-level languages that abstract away some low-level MapReduce details from the programmer. For instance, Pig is a high-level language that converts sequences of common data analysis operations (e.g., filter, sort, join, group by, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster. Hive is similar, but follows the SQL paradigm more closely.
See this CACM article and Chapter 2 of Mining Massive Data Sets for more on MapReduce. Michael Noll also has a nice tutorial. And code for the wordcount example we covered in class is on the course Github page.