Last week we saw that many social science questions can be answered by simply counting relevant quantities in the data. For example looking at the median rank of movies viewed by an individual allowed us to compute the persons’s eccentricity, a value that we then used to gauge the impact of larger catalogs on people’s happiness.
However, as datasets grow in size, even simple counting becomes a time consuming task on a single computer. Moreover, CPUs have largely stopped getting faster; instead, all modern processors exhibit multiple cores, which can do work simultaneously, and the parallelism doesn’t stop there—modern data centers have hundreds of machines each with multiple CPUs, each with multiple cores. An obvious question arises: how can we distribute counting tasks across machines and cores to take advantage of this massive computational power?
A key principle in distribution is to split up the computation so as to minimize communication between different machines. For example, suppose we have a dataset where every line represents a phone call, showing the caller, time of call, duration of call and the number dialed, and we want to identify the person who is receiving the most number of calls. If we partition the data by the callee, then we can guarantee that all calls to a particular person end up on the same machine, making it easy for each machine to calculate the most popular callee from those assigned to it. If, on the other hand, we partition by the caller, we would still need to aggregate the data across all of the machines to find out how many times an individual was dialed.
This simple example highlights that, unfortunately, there is no perfect split of the data—for example, if we wanted to instead find the person who called the most people, we would partition by the caller, not the callee; if we wanted to find the person who spent the most time on the phone, we would aggregate all phone calls to and from the same person on the same machine. A key realization here is that while the specific aggregation function differs in each case, all of these problems can be handled by one underlying infrastructure. MapReduce, first introduced by Dean and Ghemawat, is one such infrastructure that decomposes such tasks into two simple functions: “map”, which specifies how the data are to be partitioned, and “reduce”, which governs what happens on every partition.
Specifically, the MapReduce system (and its open source implementation, Hadoop), treats all data as (key, value) pairs, which the programmer writes map and reduce functions to control. In the above example, the key may be the timestamp, and the value encapsulates the caller, callee and duration of call. In the map step we define how we want the data partitioned by producing a key for each row. The MapReduce system then performs a distributed group-by, guaranteeing that all elements with the same key end up on the same machine. Thus if we want to aggregate by the caller, we set the key to be the caller_id; if we would rather aggregate by the callee, we set the key to be the id of the recipient.
In the reduce step, the programmer specifies what to do with the list of values associated with each key. If we are looking for the most popular callee, we count the number of unique people who called this person. If we are interested in the person with the largest phone bill, we count the total duration of all phone calls made by the same caller, etc.
This simple divide-and-conquer abstraction—telling the system first how to partition the data (map) and then what to do on each partition (reduce)—is immensely powerful. It easily scales to efficiently utilize thousands of machines, and allows us to efficiently compute on multi-terabyte sized inputs. We will explore the full power of this paradigm in the coming weeks.
- Python scripts for local MapReduce and simple wordcount examples are on the course GitHub page.