We used this lecture to discuss data manipulation and data visualization in R, specifically focusing on dplyr and ggplot2 from the tidyverse.

The tidyverse relies on data being in a “tidy” format of one observation per row, one variable per column, and one value per cell. It provides tools for getting untidy data (of which there’s lots) into a tidy format. Once data are in this format, it provides tools for chaining together a string of commands, similar to unix pipes, that makes it very easy to translate ideas and question in your mind into working and readable code. This allows you to spend more time exploring and understanding your data and less time debugging code.

We discussed visualization as a way to better understand data and as a way of communicating readers. We briefly reviewed experiments by Cleveland and McGill showing that not all visual encodings are created equal, Mackinlay’s expressiveness / effectiveness tradeoff, and Wilkinson’s grammar of graphics. We spent a good amount of time discussing how every visualization should convey a point, preferrably one that can be summarized by a short sentence. These data visualization slides are generously adapted from Çağatay Demiralp.

Source code for the examples we reviewed are available on the course Github page: data manipulation, data visualization.

There are lots of R resources available on the web, but here are a few highlights: