Lecture 4: Data Visualization | Modeling Social Data

We used this lecture to discuss data manipulation and data visualization in R, specifically focusing on dplyr and ggplot2 from the tidyverse.

The tidyverse relies on data being in a “tidy” format of one observation per row, one variable per column, and one value per cell. It provides tools for getting untidy data (of which there’s lots) into a tidy format. Once data are in this format, it provides tools for chaining together a string of commands, similar to unix pipes, that makes it very easy to translate ideas and question in your mind into working and readable code. This allows you to spend more time exploring and understanding your data and less time debugging code.

We discussed visualization as a way to better understand data and as a way of communicating readers. We briefly reviewed experiments by Cleveland and McGill showing that not all visual encodings are created equal, Mackinlay’s expressiveness / effectiveness tradeoff, and Wilkinson’s grammar of graphics. We spent a good amount of time discussing how every visualization should convey a point, preferrably one that can be summarized by a short sentence. These data visualization slides are generously adapted from Çağatay Demiralp.

Source code for the examples we reviewed are available on the course Github page: data manipulation, data visualization.

There are lots of R resources available on the web, but here are a few highlights:

CodeSchool and DataCamp intro to R courses
More about basic types (numeric, character, logical, factor) in R
Vectors, lists, dataframes: a one page reference and [more details]
Chapters 1, 2, and 5 of R for Data Science
DataCamp’s Data Manipulation in R tutorial
The dplyr vignette
Sean Anderson’s dplyr and pipes examples (code on github)
Rstudio’s data wrangling cheatsheet
Hadley Wickham’s split/apply/combine paper
The tidyverse style guide
Chapters 3, 7, and 28 in R for Data Science
DataCamp’s Data Visualization with ggplot2 tutorial
Videos on Visualizing Data with ggplot2
Sean Anderson’s ggplot2 slides (code) for more examples
RStudio’s cheatsheets