We spent this lecture discussing network data, beginning with a history of graph theory dating back to Euler. An overly brief summary of our whirlwind tour through several centuries of related math and science goes like this.

People have studied theoretical problems on and properties of graphs for a long time, but only in the last few decades have we had access to real network data, such as online social networks or the topology of the Internet. When these data became available, it quickly became clear that real networks looked quite different than well-studied theoretical models (e.g., Erdős–Rényi random graphs). For example, many real networks have highly skewed degree distributions, reflecting the fact that most people in a social network have few friends while only a few people have many friends. At the same time, social networks typically have short path lengths, in the sense that one needs only to traverse a handful of links to connect a randomly selected set of people in the network.

To better understand properties of networks and how to compute them, we looked at a few example networks in R using the igraph package. See this Rmarkdown notebook for different representations of networks and details for computing degree distributions and path length distributions. Additional details and references are in the outline and code on the GitHub page. See also Easley and Kleinberg’s freely available Networks, Crowds, and Markets book, specifically chapters 2, 18, and 20.

We finished up this class by looking at APIs, for accessing data from web services. Specifically, we used the New York Times Developer API to search and download published articles. Their API console is a particularly friendly way to discover its capabilities without needed to immediately write code. We also briefly looked at YQL, Yahoo’s language for interacting with a large number of APIs in a standardized, SQL-like format. See Zapier’s short course for more details about APIs.