We spent this lecture discussing representations and characteristics of networks and algorithms for analyzing network data.

After discussing many different types of networks that we might analyze as well as the various levels of abstraction available for representing them, we turned to algorithms for efficiently computing shortest path lengths, connected components, mutual friends, and clustering coefficients.

We started with the problem of finding the shortest distance between a single source node and all other nodes in a (undirected, unweighted) network, as measured by the fewest number of edges you need to traverse to get from the source to every other node. (Every researcher’s favorite version of this is computing their Erdős number, the academic take on the more well-known Kevin Bacon game. Compute yours here.)

Breadth first search (BFS) provides a nice solution. The intuition behind BFS is simple: we start from the source node and mark it as distance zero from itself. Then we visit each of its neighbors and mark those as distance one. We repeat this iteratively, pushing forward a boundary of recently discovered nodes that are one additional hop from the source at each step. BFS visits each node and edge in a network once, scaling linearly in the size of the network. If, however, we would like to find the shortest distance between all pairs of nodes then we must repeat this for each possible source node, and so this quickly becomes prohibitively expensive for even moderately sized networks. (See here for fancier, more efficient algorithms.)

Next we looked at using BFS for a related problem: finding the number of connected components, or separate pieces, of a network. We did this by simply looping over our shortest path code, seeding it on each iteration with a currently unreachable node as the source until we reach all nodes. We gave the reachable nodes in each BFS a unique label corresponding to its component.

Then we moved on to computing the number of friends that any two nodes have in common, motivated by the problem of friend recommendations on social networks. The underlying idea can be traced back to Granovetter: two people are likely to know each other if they have many mutual friends. To compute the number of mutual friends between all pairs of nodes, we exploit the fact that the neighbors of every node share that node as a common friend. To count all mutual friends we simply loop over each node and increment a counter for every pair of its neighbors. For each node this scales as the square of its degree, so the whole algorithm scales as the sum of the squared degrees of all nodes. This can quickly become expensive if we have even a few high-degree nodes, which are quite common in practice.

Finally, we looked at the closely related problem of counting the number of triangles around each node in a network. This algorithm is nearly identical to computing mutual friends, as we generate the same set of two-hop paths through all pairs of a node’s neighbors, but simply increment different counters to generate different results. Instead of accumulating mutual friends for each pair of a node’s neighbors, we ask whether every pair of neighbors are themselves directly connected. If so, we count this as (half of) a triangle in which the node participates. Dividing the number of closed triangles in a network by the number of possible triangles that could be present gives a useful for how clustered a network is.

To better understand properties of networks and how to compute them, we looked at a few example networks in R using the igraph package. See the notebooks on the course GitHub page for related code and data used in the lectures.

We finished lecture with a preview of issues related to causal inference.

References: