Monthly Archives: February 2013

Homework 1

The first homework is posted.

The first problem looks at the impact of inventory size on customer satisfaction for the MovieLens data, the second is an exercise in simple streaming calculations, and the third explores various counting scenarios.

A script to download the data for the first question as well as a solution template for the second are available on the course GitHub page.

Lecture 4: Counting at Scale, Part II

The simple divide and conquer abstraction of MapReduce that we saw in the past lecture is immensely powerful for large scale data analysis. Many of the common statistics that we are interested in computing fall into this formulation.

For example, to find the maximum of a set of numbers, we can partition the dataset into multiple groups, find the maximum of each group and return the maximum of the group maxima. To find the average, we find the total sum associated with each group, as well as the number of elements in each, then combine these in an obvious manner. Not all functions can be computed this way—for example computing the median is much harder, as we cannot simply look at the median of each group. In fact, there is no deterministic algorithm that finds the median in one pass through the data without keeping all of the data in memory. Rather this requires a multi-pass approach, first finding the approximate range where the median lies, and then iterating through the data again to find the exact median.

While computing summary statistics is relatively easy once the data is in the right form, getting to this point usually requires a fair amount of work. For instance, we often have multiple input sources, each describing different aspects of the data—we may have one file describing the ratings that each person gave to each movie, a separate file with movie details (e.g., year of release, director, etc.), and yet another file containing user information (e.g., demographics, reported location, etc.). Thus if we want to find the movie from 1985 most watched by people in New York, we have to combine all of these datasets before issuing a simple maximum query.

This kind of combination is formally called a join. There are many possible kinds of joins—left, right, inner, outer, etc.—each of which performs the combination in a particular manner. For example, given a dataset mapping each user rating to a movie, and a movie to its year of release, an inner join on the movie name will produce a three column table—with a rating, movie name, and release year, but only for those movies that have both a rating and a release year; any incomplete rows in the resulting table will be dropped. On the other hand, a left join will keep all entries of the first table (here the movie ratings table) even for movies that have no release year, but will not produce movies with a release year but no ratings. Conversely, a right join will keep all entries of the second table, even for movies that have no ratings; finally, a full outer join will keep all entries and will fill in missing data whenever possible.

Typical analyses require multiple such joins followed by the relevant calculations on the resulting collections. While one can write a standalone map reduce program for each one of the steps, the logistics of managing the flow of these jobs and all of the intermediate data quickly becomes non-trivial. As we will discuss in coming lectures, Pig—a data processing language that is built on top of MapReduce—facilitates and simplifies the execution of these and other common MapReduce operations, including joins, filtering, sorting, and grouping.

Lecture 3: Counting at Scale, Part I

Last week we saw that many social science questions can be answered by simply counting relevant quantities in the data. For example looking at the median rank of movies viewed by an individual allowed us to compute the persons’s eccentricity, a value that we then used to gauge the impact of larger catalogs on people’s happiness.

However, as datasets grow in size, even simple counting becomes a time consuming task on a single computer. Moreover, CPUs have largely stopped getting faster; instead, all modern processors exhibit multiple cores, which can do work simultaneously, and the parallelism doesn’t stop there—modern data centers have hundreds of machines each with multiple CPUs, each with multiple cores. An obvious question arises: how can we distribute counting tasks across machines and cores to take advantage of this massive computational power?

A key principle in distribution is to split up the computation so as to minimize communication between different machines. For example, suppose we have a dataset where every line represents a phone call, showing the caller, time of call, duration of call and the number dialed, and we want to identify the person who is receiving the most number of calls. If we partition the data by the callee, then we can guarantee that all calls to a particular person end up on the same machine, making it easy for each machine to calculate the most popular callee from those assigned to it. If, on the other hand, we partition by the caller, we would still need to aggregate the data across all of the machines to find out how many times an individual was dialed.

This simple example highlights that, unfortunately, there is no perfect split of the data—for example, if we wanted to instead find the person who called the most people, we would partition by the caller, not the callee; if we wanted to find the person who spent the most time on the phone, we would aggregate all phone calls to and from the same person on the same machine. A key realization here is that while the specific aggregation function differs in each case, all of these problems can be handled by one underlying infrastructure. MapReduce, first introduced by Dean and Ghemawat, is one such infrastructure that decomposes such tasks into two simple functions: “map”, which specifies how the data are to be partitioned, and “reduce”, which governs what happens on every partition.

Specifically, the MapReduce system (and its open source implementation, Hadoop), treats all data as (key, value) pairs, which the programmer writes map and reduce functions to control. In the above example, the key may be the timestamp, and the value encapsulates the caller, callee and duration of call. In the map step we define how we want the data partitioned by producing a key for each row. The MapReduce system then performs a distributed group-by, guaranteeing that all elements with the same key end up on the same machine. Thus if we want to aggregate by the caller, we set the key to be the caller_id; if we would rather aggregate by the callee, we set the key to be the id of the recipient.

In the reduce step, the programmer specifies what to do with the list of values associated with each key. If we are looking for the most popular callee, we count the number of unique people who called this person. If we are interested in the person with the largest phone bill, we count the total duration of all phone calls made by the same caller, etc.

This simple divide-and-conquer abstraction—telling the system first how to partition the data (map) and then what to do on each partition (reduce)—is immensely powerful. It easily scales to efficiently utilize thousands of machines, and allows us to efficiently compute on multi-terabyte sized inputs. We will explore the full power of this paradigm in the coming weeks.


  • Python scripts for local MapReduce and simple wordcount examples are on the course GitHub page.

Lecture 2: An Introduction to Counting

This week marked the first of three lectures on counting. While this may seem like a lot of time to devote to a very simple concept, we’ll see that counting at scale is surprisingly more powerful—and unfortunately somewhat more difficult—than one might think.

Take, for instance, the questions we discussed around the long tail of consumption last week: How does interest in niche content vary across individuals, and what’s the impact of inventory size on customer satisfaction? At their core, these questions can be rephrased as simple counting problems where we first group the data (e.g., by user, by item, etc.) and then compute distributions and statistics of interest within each group.

For example, one approach to quantifying users’ eccentricity is to first rank movies by popularity and then compute the median rank of the movies rated by each individual. The first operation is a group-by-movie followed by a simple count within each group to measure movie popularity, while the second is a group-by-user followed by a median within each group to measure the popularity of a typical movie watched by each user. Mainstream users who tend to watch more popular content will score lower on this eccentricity measure, while more niche individuals will have larger eccentricity scores due to their consumption of tail content.

Such calculations are relatively easy for small datasets (e.g., 10M ratings from MovieLens), as various group-by and counting operations are quite fast once we’ve read all of the data from disk into main memory. The same analysis becomes a bit more challenging for larger datasets (e.g., Netflix) that don’t fit in memory, as random access from disk is about 1000x slower than equivalent lookups in memory. One solution is to appeal to much faster sequential disk access to stream through the data one observation at a time, storing and updating only the statistics of interest in memory.

For instance, to compute the average rating for each movie in the Netflix dataset, we can simply stream through the file and examine one rating at a time, updating a running mean for each movie as it appears. Doing so requires that we hold a small dictionary over all 20k movies in memory rather than the entire dataset of all 100 million ratings, reducing our memory footprint from gigabytes to megabytes.

This same approach scales to billions or trillions of observations so long as the number of distinct groups (e.g., movies) remains small, and can be extended to arbitrary descriptive statistics when we can afford to hold a dictionary of counts over distinct outcomes (e.g., 5 rating values) within each group. For more restrictive settings where we have a reasonable number of groups (e.g., users) but a potentially large number of distinct outcomes within each group (e.g., each movie’s popularity), we may lose the ability to store within-group distributions in memory or compute quantiles, but can still compute combinable statistics such as counts and means via incremental updates over each observation. Finally, when we have an extremely large number of groups (e.g., billions of videos on YouTube), even simple within-group statistics (e.g., number of views by video) become intractable for arbitrarily ordered input data.

These scenarios are summarized in the following table, where N is the total number of observations, G is the number of groups, V is the average number of distinct outcomes within groups, and the data are assumed to be in arbitrary order.

Memory Scenario Distributions Statistics Example
N Small dataset Yes General Arbitrary stats on MovieLens
V*G Small distributions Yes General Median rating by movie for Netflix
G Small # groups No Combinable Mean rating by movie for Netflix
V Small # outcomes No No Mean movie popularity by user for Netflix
1 Large # both No No Mean video popularity for YouTube

As we’ll see in the next lecture, the set of computable statistics improves substantially for pre-grouped input data, where all observations for a given group appear sequentially. As a result we’ll shift our attention to MapReduce, a system designed to efficiently solve the group-by problem for large-scale data.


  • The course GitHub page has code to download the MovieLens dataset and R scripts for the user eccentricity analysis, as well as code for simple streaming analysis of the Netflix data.
  • Here are a few R tutorials and shell tutorials if you’re new to these tools.
  • Linux and Mac OS X users should have pre-installed GNU tools including sed, awk, grep, cut, and uniq. Windows users can download Cygwin or MinGW to install these tools.
  • See these notes on streaming variance algorithms for an glimpse at some more interesting combinable statistics.

Lecture 1: Case Studies in Computational Social Science

This week’s lecture provided a high-level introduction to computational social science, an emerging field at the intersection of social science, statistics, and computer science that aims to use large-scale, individual-level data on who people are (demographics), what they do (behavioral & temporal data), and who they know (networks) to further our understanding of human behavior. In contrast to traditional approaches to social science (e.g., surveys, aggregate data analysis, and lab experiments), the large-scale nature of these data present unique computational and methodological challenges that will be discussed throughout the course.

We discussed a basic research loop, with three broad but important steps:

  1. Formulate the question.
  2. Find/collect data.
  3. Address methodological challenges.

While it may be tempting to think of this as a simple, linear process, in practice we often find ourselves iterating through the loop several times to address various complications that arise. For instance, we may need to refine the motivating question given methodological challenges or limitations of the available data, or develop novel methods to deal with computational issues.

We then discussed several different questions from a variety of domains:

  • Marketing: The long tail of consumption. What is the impact of inventory size on customer satisfication? How does interest in niche content vary across individuals?

  • Political science: The convention bounce. Do people actually switch which candidate they support? Are there overall population-level shifts in candidate support?

  • Demography: The digital divide. Does Internet access/usage affect health, education, and employment outcomes? How do Internet usage patterns vary across subpopulations?

  • Economics: Auction design. How do you optimally set auction parameters (e.g., reserve prices, “buy it now”, etc.)?

  • Communication theory: Information diffusion. How do ideas and products spread through society? What is the empirical structure of diffusion cascades?

The technical details of investigating these questions will be the subject of subsequent lectures, but curious readers can find more information in the following papers.