Lecture 4: Counting at Scale, Part II

The simple divide and conquer abstraction of MapReduce that we saw in the past lecture is immensely powerful for large scale data analysis. Many of the common statistics that we are interested in computing fall into this formulation.

For example, to find the maximum of a set of numbers, we can partition the dataset into multiple groups, find the maximum of each group and return the maximum of the group maxima. To find the average, we find the total sum associated with each group, as well as the number of elements in each, then combine these in an obvious manner. Not all functions can be computed this way—for example computing the median is much harder, as we cannot simply look at the median of each group. In fact, there is no deterministic algorithm that finds the median in one pass through the data without keeping all of the data in memory. Rather this requires a multi-pass approach, first finding the approximate range where the median lies, and then iterating through the data again to find the exact median.

While computing summary statistics is relatively easy once the data is in the right form, getting to this point usually requires a fair amount of work. For instance, we often have multiple input sources, each describing different aspects of the data—we may have one file describing the ratings that each person gave to each movie, a separate file with movie details (e.g., year of release, director, etc.), and yet another file containing user information (e.g., demographics, reported location, etc.). Thus if we want to find the movie from 1985 most watched by people in New York, we have to combine all of these datasets before issuing a simple maximum query.

This kind of combination is formally called a join. There are many possible kinds of joins—left, right, inner, outer, etc.—each of which performs the combination in a particular manner. For example, given a dataset mapping each user rating to a movie, and a movie to its year of release, an inner join on the movie name will produce a three column table—with a rating, movie name, and release year, but only for those movies that have both a rating and a release year; any incomplete rows in the resulting table will be dropped. On the other hand, a left join will keep all entries of the first table (here the movie ratings table) even for movies that have no release year, but will not produce movies with a release year but no ratings. Conversely, a right join will keep all entries of the second table, even for movies that have no ratings; finally, a full outer join will keep all entries and will fill in missing data whenever possible.

Typical analyses require multiple such joins followed by the relevant calculations on the resulting collections. While one can write a standalone map reduce program for each one of the steps, the logistics of managing the flow of these jobs and all of the intermediate data quickly becomes non-trivial. As we will discuss in coming lectures, Pig—a data processing language that is built on top of MapReduce—facilitates and simplifies the execution of these and other common MapReduce operations, including joins, filtering, sorting, and grouping.