This week marked the first of three lectures on counting. While this may seem like a lot of time to devote to a very simple concept, we’ll see that counting at scale is surprisingly more powerful—and unfortunately somewhat more difficult—than one might think.
Take, for instance, the questions we discussed around the long tail of consumption last week: How does interest in niche content vary across individuals, and what’s the impact of inventory size on customer satisfaction? At their core, these questions can be rephrased as simple counting problems where we first group the data (e.g., by user, by item, etc.) and then compute distributions and statistics of interest within each group.
For example, one approach to quantifying users’ eccentricity is to first rank movies by popularity and then compute the median rank of the movies rated by each individual. The first operation is a group-by-movie followed by a simple count within each group to measure movie popularity, while the second is a group-by-user followed by a median within each group to measure the popularity of a typical movie watched by each user. Mainstream users who tend to watch more popular content will score lower on this eccentricity measure, while more niche individuals will have larger eccentricity scores due to their consumption of tail content.
Such calculations are relatively easy for small datasets (e.g., 10M ratings from MovieLens), as various group-by and counting operations are quite fast once we’ve read all of the data from disk into main memory. The same analysis becomes a bit more challenging for larger datasets (e.g., Netflix) that don’t fit in memory, as random access from disk is about 1000x slower than equivalent lookups in memory. One solution is to appeal to much faster sequential disk access to stream through the data one observation at a time, storing and updating only the statistics of interest in memory.
For instance, to compute the average rating for each movie in the Netflix dataset, we can simply stream through the file and examine one rating at a time, updating a running mean for each movie as it appears. Doing so requires that we hold a small dictionary over all 20k movies in memory rather than the entire dataset of all 100 million ratings, reducing our memory footprint from gigabytes to megabytes.
This same approach scales to billions or trillions of observations so long as the number of distinct groups (e.g., movies) remains small, and can be extended to arbitrary descriptive statistics when we can afford to hold a dictionary of counts over distinct outcomes (e.g., 5 rating values) within each group. For more restrictive settings where we have a reasonable number of groups (e.g., users) but a potentially large number of distinct outcomes within each group (e.g., each movie’s popularity), we may lose the ability to store within-group distributions in memory or compute quantiles, but can still compute combinable statistics such as counts and means via incremental updates over each observation. Finally, when we have an extremely large number of groups (e.g., billions of videos on YouTube), even simple within-group statistics (e.g., number of views by video) become intractable for arbitrarily ordered input data.
These scenarios are summarized in the following table, where N is the total number of observations, G is the number of groups, V is the average number of distinct outcomes within groups, and the data are assumed to be in arbitrary order.
|N||Small dataset||Yes||General||Arbitrary stats on MovieLens|
|V*G||Small distributions||Yes||General||Median rating by movie for Netflix|
|G||Small # groups||No||Combinable||Mean rating by movie for Netflix|
|V||Small # outcomes||No||No||Mean movie popularity by user for Netflix|
|1||Large # both||No||No||Mean video popularity for YouTube|
As we’ll see in the next lecture, the set of computable statistics improves substantially for pre-grouped input data, where all observations for a given group appear sequentially. As a result we’ll shift our attention to MapReduce, a system designed to efficiently solve the group-by problem for large-scale data.
- The course GitHub page has code to download the MovieLens dataset and R scripts for the user eccentricity analysis, as well as code for simple streaming analysis of the Netflix data.
- Here are a few R tutorials and shell tutorials if you’re new to these tools.
- Linux and Mac OS X users should have pre-installed GNU tools including sed, awk, grep, cut, and uniq. Windows users can download Cygwin or MinGW to install these tools.
- See these notes on streaming variance algorithms for an glimpse at some more interesting combinable statistics.