In the first part of lecture we focused on adapting the network algorithms discussed in last week’s class to the MapReduce framework, to handle larger datasets. When working with a small to medium sized network that fits in memory on a single machine, it’s easy to take for granted that you have instant, random access to the entire network. Parallel frameworks such as MapReduce allow you to work with larger networks, but they don’t share this feature, which complicates things. This discussion was based on slides and code from a 2010 tutorial, which gives details for computing degree distributions, clustering coefficients, and shortest paths in MapReduce.
Take, for instance, breadth-first search (BFS). On a single machine we simply maintain a boundary of discovered nodes and increment the distance to any undiscovered neighbors of the boundary. In MapReduce, the boundary is likely to be distributed across many different machines, and its neighbors are themselves probably on entirely different machines. As a result, we have to do a lot of redundant work to implement BFS in MapReduce. Intuitively, the algorithm works by expanding the boundary one step for every round of MapReduce that’s completed. In each round, every discovered node sends all of its neighbors an update for its possible distance in the map phase, and each undiscovered node takes on the minimum of all distance updates it receives in the reduce phase. This allows us to scale BFS to networks that are too large to fit in memory, but can require many rounds of MapReduce for networks with large diameters (e.g., having long chains), which can be expensive in practice.
Similarly, challenges are met in porting node-level clustering coefficient or mutual friends calculations from a single machine to MapReduce. Here the issue is not running over many rounds, but is due to generating a large amount of intermediate data in the shuffle during one round. Both of these algorithms rely on generating two-hop paths from adjacency lists. In MapReduce, this means that every node must effectively tell each of its neighbors about all of its other neighbors, creating a quadratic increase in the intermediate data handled by the shuffle. This dramatically slows down the shuffle, and introduces what as been termed the Curse of the Last Reducer where most tasks execute quickly, but the few unlucky machines that deal with high-degree nodes block progression of the computation. (This paper contains several clever improvements on the simple two-hop path generation approach.)
All of this serves to point out that while many computations are possible in MapReduce, not all are advisable. Simply put, in some cases MapReduce just isn’t the right framework, and for parallel network computations this can often be the case. There is theoretical work that provides a Model of MapReduce to formalize these ideas. There are also alternative frameworks for parallel network algorithms, such as Google’s Pregel and Apache’s open source clone, Giraph. Another quite effective option is simply to use a high-memory machine for many of these tasks.
In the second half of lecture, we looked a few prediction problems on networks for which you might use features generated by these network algorithms. This includes predicting demographics and individual behavior by exploiting homophily—or the tendency for people to form ties to others who are similar to them—in social networks.
In preparation for discussing experiments next week, we concluded with a brief overview of statistical inference. We reviewed point estimates, confidence intervals, and hypothesis testing, all through simulations. See the Rmarkdown notebook for simulation-based approaches to statistical inference, also hosted on the course GitHub page, as well as Yakir’s excellent (and free) online Introduction to Statistical Thinking (With R, Without Calculus).