Lecture 2: An Introduction to Counting

This week marked the first of three lectures on counting. While this may seem like a lot of time to devote to a very simple concept, we’ll see that counting at scale is surprisingly more powerful—and unfortunately somewhat more difficult—than one might think.

Take, for instance, the questions we discussed around the long tail of consumption last week: How does interest in niche content vary across individuals, and what’s the impact of inventory size on customer satisfaction? At their core, these questions can be rephrased as simple counting problems where we first group the data (e.g., by user, by item, etc.) and then compute distributions and statistics of interest within each group.

For example, one approach to quantifying users’ eccentricity is to first rank movies by popularity and then compute the median rank of the movies rated by each individual. The first operation is a group-by-movie followed by a simple count within each group to measure movie popularity, while the second is a group-by-user followed by a median within each group to measure the popularity of a typical movie watched by each user. Mainstream users who tend to watch more popular content will score lower on this eccentricity measure, while more niche individuals will have larger eccentricity scores due to their consumption of tail content.

Such calculations are relatively easy for small datasets (e.g., 10M ratings from MovieLens), as various group-by and counting operations are quite fast once we’ve read all of the data from disk into main memory. The same analysis becomes a bit more challenging for larger datasets (e.g., Netflix) that don’t fit in memory, as random access from disk is about 1000x slower than equivalent lookups in memory. One solution is to appeal to much faster sequential disk access to stream through the data one observation at a time, storing and updating only the statistics of interest in memory.

For instance, to compute the average rating for each movie in the Netflix dataset, we can simply stream through the file and examine one rating at a time, updating a running mean for each movie as it appears. Doing so requires that we hold a small dictionary over all 20k movies in memory rather than the entire dataset of all 100 million ratings, reducing our memory footprint from gigabytes to megabytes.

This same approach scales to billions or trillions of observations so long as the number of distinct groups (e.g., movies) remains small, and can be extended to arbitrary descriptive statistics when we can afford to hold a dictionary of counts over distinct outcomes (e.g., 5 rating values) within each group. For more restrictive settings where we have a reasonable number of groups (e.g., users) but a potentially large number of distinct outcomes within each group (e.g., each movie’s popularity), we may lose the ability to store within-group distributions in memory or compute quantiles, but can still compute combinable statistics such as counts and means via incremental updates over each observation. Finally, when we have an extremely large number of groups (e.g., billions of videos on YouTube), even simple within-group statistics (e.g., number of views by video) become intractable for arbitrarily ordered input data.

These scenarios are summarized in the following table, where N is the total number of observations, G is the number of groups, V is the average number of distinct outcomes within groups, and the data are assumed to be in arbitrary order.

Memory Scenario Distributions Statistics Example
N Small dataset Yes General Arbitrary stats on MovieLens
V*G Small distributions Yes General Median rating by movie for Netflix
G Small # groups No Combinable Mean rating by movie for Netflix
V Small # outcomes No No Mean movie popularity by user for Netflix
1 Large # both No No Mean video popularity for YouTube

As we’ll see in the next lecture, the set of computable statistics improves substantially for pre-grouped input data, where all observations for a given group appear sequentially. As a result we’ll shift our attention to MapReduce, a system designed to efficiently solve the group-by problem for large-scale data.


  • The course GitHub page has code to download the MovieLens dataset and R scripts for the user eccentricity analysis, as well as code for simple streaming analysis of the Netflix data.
  • Here are a few R tutorials and shell tutorials if you’re new to these tools.
  • Linux and Mac OS X users should have pre-installed GNU tools including sed, awk, grep, cut, and uniq. Windows users can download Cygwin or MinGW to install these tools.
  • See these notes on streaming variance algorithms for an glimpse at some more interesting combinable statistics.

Lecture 1: Case Studies in Computational Social Science

This week’s lecture provided a high-level introduction to computational social science, an emerging field at the intersection of social science, statistics, and computer science that aims to use large-scale, individual-level data on who people are (demographics), what they do (behavioral & temporal data), and who they know (networks) to further our understanding of human behavior. In contrast to traditional approaches to social science (e.g., surveys, aggregate data analysis, and lab experiments), the large-scale nature of these data present unique computational and methodological challenges that will be discussed throughout the course.

We discussed a basic research loop, with three broad but important steps:

  1. Formulate the question.
  2. Find/collect data.
  3. Address methodological challenges.

While it may be tempting to think of this as a simple, linear process, in practice we often find ourselves iterating through the loop several times to address various complications that arise. For instance, we may need to refine the motivating question given methodological challenges or limitations of the available data, or develop novel methods to deal with computational issues.

We then discussed several different questions from a variety of domains:

  • Marketing: The long tail of consumption. What is the impact of inventory size on customer satisfication? How does interest in niche content vary across individuals?

  • Political science: The convention bounce. Do people actually switch which candidate they support? Are there overall population-level shifts in candidate support?

  • Demography: The digital divide. Does Internet access/usage affect health, education, and employment outcomes? How do Internet usage patterns vary across subpopulations?

  • Economics: Auction design. How do you optimally set auction parameters (e.g., reserve prices, “buy it now”, etc.)?

  • Communication theory: Information diffusion. How do ideas and products spread through society? What is the empirical structure of diffusion cascades?

The technical details of investigating these questions will be the subject of subsequent lectures, but curious readers can find more information in the following papers.