We used this lecture to discuss data manipulation and data visualization in R, specifically focusing on dplyr
and ggplot2
from the tidyverse
.
The tidyverse
relies on data being in a “tidy” format of one observation per row, one variable per column, and one value per cell. It provides tools for getting untidy data (of which there’s lots) into a tidy format. Once data are in this format, it provides tools for chaining together a string of commands, similar to unix pipes, that makes it very easy to translate ideas and question in your mind into working and readable code. This allows you to spend more time exploring and understanding your data and less time debugging code.
We discussed visualization as a way to better understand data and as a way of communicating readers. We briefly reviewed experiments by Cleveland and McGill showing that not all visual encodings are created equal, Mackinlay’s expressiveness / effectiveness tradeoff, and Wilkinson’s grammar of graphics. We spent a good amount of time discussing how every visualization should convey a point, preferrably one that can be summarized by a short sentence. These data visualization slides are generously adapted from Çağatay Demiralp.
Source code for the examples we reviewed are available on the course Github page: data manipulation, data visualization.
There are lots of R resources available on the web, but here are a few highlights:
- CodeSchool and DataCamp intro to R courses
- More about basic types (numeric, character, logical, factor) in R
- Vectors, lists, dataframes: a one page reference and [more details]
- Chapters 1, 2, and 5 of R for Data Science
- DataCamp’s Data Manipulation in R tutorial
- The dplyr vignette
- Sean Anderson’s dplyr and pipes examples (code on github)
- Rstudio’s data wrangling cheatsheet
- Hadley Wickham’s split/apply/combine paper
- The tidyverse style guide
- Chapters 3, 7, and 28 in R for Data Science
- DataCamp’s Data Visualization with ggplot2 tutorial
- Videos on Visualizing Data with ggplot2
- Sean Anderson’s ggplot2 slides (code) for more examples
- RStudio’s cheatsheets