We spent these two lectures discussing network data, including a whirlwhind tour of the history of network theory, representations and characteristics of networks, and algorithms for analyzing network data.

People have studied theoretical problems on and properties of graphs for a long time, but only in the last few decades have we had access to real network data, such as online social networks or the topology of the Internet. When these data became available, it quickly became clear that real networks looked quite different than well-studied theoretical models (e.g., Erdős–Rényi random graphs). For example, many real networks have highly skewed degree distributions, reflecting the fact that most people in a social network have few friends while only a few people have many friends. At the same time, social networks typically have short path lengths, in the sense that one needs only to traverse a handful of links to connect a randomly selected set of people in the network.

After discussing many different types of networks that we might analyze as well as the various levels of abstraction available for representing them, we turned to algorithms for efficiently computing shortest path lengths, connected components, mutual friends, and clustering coefficients.

We started with the problem of finding the shortest distance between a single source node and all other nodes in a (undirected, unweighted) network, as measured by the fewest number of edges you need to traverse to get from the source to every other node. (Every researcher’s favorite version of this is computing their Erdős number, the academic take on the more well-known Kevin Bacon game. Compute yours here.)

Breadth first search (BFS) provides a nice solution. The intuition behind BFS is simple: we start from the source node and mark it as distance zero from itself. Then we visit each of its neighbors and mark those as distance one. We repeat this iteratively, pushing forward a boundary of recently discovered nodes that are one additional hop from the source at each step. BFS visits each node and edge in a network once, scaling linearly in the size of the network. If, however, we would like to find the shortest distance between all pairs of nodes then we must repeat this for each possible source node, and so this quickly becomes prohibitively expensive for even moderately sized networks. (See here for fancier, more efficient algorithms.)

Next we looked at using BFS for a related problem: finding the number of connected components, or separate pieces, of a network. We did this by simply looping over our shortest path code, seeding it on each iteration with a currently unreachable node as the source until we reach all nodes. We gave the reachable nodes in each BFS a unique label corresponding to its component.

Then we moved on to computing the number of friends that any two nodes have in common, motivated by the problem of friend recommendations on social networks. The underlying idea can be traced back to Granovetter: two people are likely to know each other if they have many mutual friends. To compute the number of mutual friends between all pairs of nodes, we exploit the fact that the neighbors of every node share that node as a common friend. To count all mutual friends we simply loop over each node and increment a counter for every pair of its neighbors. For each node this scales as the square of its degree, so the whole algorithm scales as the sum of the squared degrees of all nodes. This can quickly become expensive if we have even a few high-degree nodes, which are quite common in practice.

Finally, we looked at the closely related problem of counting the number of triangles around each node in a network. This algorithm is nearly identical to computing mutual friends, as we generate the same set of two-hop paths through all pairs of a node’s neighbors, but simply increment different counters to generate different results. Instead of accumulating mutual friends for each pair of a node’s neighbors, we ask whether every pair of neighbors are themselves directly connected. If so, we count this as (half of) a triangle in which the node participates. Dividing the number of closed triangles in a network by the number of possible triangles that could be present gives a useful for how clustered a network is.

To better understand properties of networks and how to compute them, we looked at a few example networks in R using the igraph package. See the notebooks on the course GitHub page for related code and data used in the lectures.