Author Archives: Jake Hofman

Lectures 11 & 12: Regression

These lectures explored both the practice of and theory behind applied regression analysis. As stated by Cook & Weisberg, regression analysis aims “to understand, as far as possible with the available data, how the conditional distribution of the response varies across subpopulations determined by the possible values of the predictor(s).” This includes everything from simple descriptive statistics discussed in our earlier lectures to more sophisticated statistical models for situations in which we have a small number of observations relative to the number of subpopulations. As Berk explains in his constructive critique, the aim of such analyses may vary from providing a compact description of the data, to predicting future outcomes, to explaining associations between outcomes and predictors.

We discussed two case studies: Predicting consumer behavior with Web search and A Large-Scale Study of Browsing Behavior. The first explored the relative value of various data sources for predicting the performance of movies, music, and video games, using simple auto-regressive models to generate future sales predictions from current and past performance. The second study investigated variation in how different demographic groups use the Web, making use of linear models to smooth descriptive statistics over subpopulations and control for covariation between predictors (e.g., education and race).

In the following lecture we reviewed the formalism behind ordinary least squares regression, phrased as maximum likelihood inference under the assumption of additive Gaussian noise. We then derived the closed-form solution to the normal equations for small-scale problems and discussed alternative optimization methods, namely gradient descent and stochastic gradient descent, for larger-scale settings.

Ordinary least squares regression is, in principle, easily solved by inverting the normal equations:
$$ \hat{w} = (X^T X)^{-1} X^T y.$$
In practice, however, it often computationally expensive to do necessary the matrix inversion for models with many features, even with specialized numerical methods.

Gradient descent offers an alternative solution to the normal equations, replacing potentially expensive matrix inversion with an iterative method where we update parameters by moving in the direction of steepest increase of the likelihood landscape:
$$ \hat{w} \leftarrow \hat{w} + \eta X^T (y – X\hat{w}), $$
where eta is a tunable step size. Choosing eta too small leads to slow convergence, whereas too large a step size may result in undesirable oscillations about local optima. Intuitively, gradient descent updates each component of the weight vector by a sum of the corresponding feature values over all examples, where examples are weighted by the error between actual and predicted labels. Stochastic gradient descent provides an even more scalable alternative where an approximation to the gradient from a subsample of the data is used in place of the entire dataset.

See Chapter 3 of Bishop and Chapter 3 of Hastie for reference.

Lecture 13: Classification

In this lecture we discussed classification methods for predicting discrete-valued targets (e.g., spam classification or gender identification). We noted several potential issues in directly applying linear regression to classification problems and explored naive Bayes and logistic regression as alternatives.

We first reviewed Bayes’ rule for inverting conditional probabilities via a simple, but perhaps counterintuitive, medical diagnosis example and then adapted this to an (extremely naive) one-feature classifier. We improved upon this by considering naive Bayes—a simple linear method for classification in which we model each feature independently. While the independence assumption is almost definitely incorrect, naive Bayes turns out to work well in practice. In addition, naive Bayes is simple to train and predict with at scale. Unfortunately, however, it does fail to account for correlations amongst features.

Logistic regression addresses this issue by modeling the class-conditional probabilities directly, using a logistic function to transform predictions from a linear model to lie in the unit interval: $$ p(y=1|x, w) = {1 \over 1 + e^{-w \cdot x}} $$ While maximum likelihood inference for logistic regression does not permit a closed-form solution, gradient descent results in the following update equations, similar to linear regression: $$ \hat{w} \leftarrow \hat{w} + \eta X^T (y – p). $$ In smaller-scale settings one can improve on these updates by using second-order methods such as Newton-Raphson that leverage the local curvature of the likelihood landscape to determine the step size at each iteration. As with regression, some form of regularization is often useful for balancing the fit to training data with generalization error when one has a relatively large number of features.

References include Chapter 4 of Bishop, Chapter 4 of Hastie, Chapter 6 of Segaran, Horvitz, et. al., 1998, Lewis, 1998, Graham, 2002, and Metsis, et. al., 2006.

Lecture 9: Data Wrangling

This is a guest lecture and post by John Myles White.

In this lecture, we talked about methods for getting data. We ranked methods in terms of their ease of use. For example, the easiest method was getting a bulk download of an entire data set. We noted that there are several clearinghouses of data that link to many publicly available data sets, including data from Wikipedia, IMDB,, and others. When these bulk downloads are not available, we noted that many web sites (e.g., NYTimes, Twitter, Google, etc.) offer API access with which you can download chunks of data at a time and slowly accumulate a large body of data. When even this is not possible, we noted that one can scrape data from sites so long as the Terms of Service allow automated access to the site using tools such as BeautifulSoup or Nokogiri.

In the second half of the lecture, we discussed how to work with the data that one acquires from websites using any of the three methods above. This data is often structured in formats like JSON and XML that must be parsed by the user using formal parsing libraries available in many popular languages like Python. Sometimes the data is in an unstructured format in which we simply want to extract basic information like phone numbers: we described the use of regular expressions as a mechanism for extracting this information. We worked through an extended example of building a regular expression that would match phone numbers.

Homework 2

The second homework is posted.

The first problem is a simple word count exercise over the Wikipedia corpus, the second examines Wikipedia page popularity, and the third explores tie strength between co-authors.

See Amazon’s getting started videos the references from lectures 5 and 6 for more information on Pig, EC2, and Elastic Mapreduce.

Some tips:

  1. Use the template solution files to test and debug Pig scripts on your local machine.

  2. Create a bucket with a unique name (e.g., your UNI) using the S3 console:
    Step 1

  3. Upload your locally tested Pig script to S3:
    Step 2

  4. Create a Pig job flow in the Elastic Mapreduce console:
    Step 3

  5. Specify the path to your Pig script on S3, along with input and output paths:
    Step 4

  6. Select the number of instances (5 small instances should be sufficient):
    Step 5

  7. Specify a log path for debugging and a keypair if you’d like to log into the cluster while the job is running:
    Step 6

  8. To avoid an error in allocating heap space for Java when the job starts, select the “Memory Intensive Configuration” bootstrap script:
    Step 7

  9. Review job details and submit the job:
    Step 8

  10. Monitor the job status through the Elastic Mapreduce console or log into the machine with ssh (or Putty) and check the JobTracker with lynx:

    ssh -i /path/to/keypair.pem
    $ lynx http://localhost:9100

Homework 1

The first homework is posted.

The first problem looks at the impact of inventory size on customer satisfaction for the MovieLens data, the second is an exercise in simple streaming calculations, and the third explores various counting scenarios.

A script to download the data for the first question as well as a solution template for the second are available on the course GitHub page.

Lecture 4: Counting at Scale, Part II

The simple divide and conquer abstraction of MapReduce that we saw in the past lecture is immensely powerful for large scale data analysis. Many of the common statistics that we are interested in computing fall into this formulation.

For example, to find the maximum of a set of numbers, we can partition the dataset into multiple groups, find the maximum of each group and return the maximum of the group maxima. To find the average, we find the total sum associated with each group, as well as the number of elements in each, then combine these in an obvious manner. Not all functions can be computed this way—for example computing the median is much harder, as we cannot simply look at the median of each group. In fact, there is no deterministic algorithm that finds the median in one pass through the data without keeping all of the data in memory. Rather this requires a multi-pass approach, first finding the approximate range where the median lies, and then iterating through the data again to find the exact median.

While computing summary statistics is relatively easy once the data is in the right form, getting to this point usually requires a fair amount of work. For instance, we often have multiple input sources, each describing different aspects of the data—we may have one file describing the ratings that each person gave to each movie, a separate file with movie details (e.g., year of release, director, etc.), and yet another file containing user information (e.g., demographics, reported location, etc.). Thus if we want to find the movie from 1985 most watched by people in New York, we have to combine all of these datasets before issuing a simple maximum query.

This kind of combination is formally called a join. There are many possible kinds of joins—left, right, inner, outer, etc.—each of which performs the combination in a particular manner. For example, given a dataset mapping each user rating to a movie, and a movie to its year of release, an inner join on the movie name will produce a three column table—with a rating, movie name, and release year, but only for those movies that have both a rating and a release year; any incomplete rows in the resulting table will be dropped. On the other hand, a left join will keep all entries of the first table (here the movie ratings table) even for movies that have no release year, but will not produce movies with a release year but no ratings. Conversely, a right join will keep all entries of the second table, even for movies that have no ratings; finally, a full outer join will keep all entries and will fill in missing data whenever possible.

Typical analyses require multiple such joins followed by the relevant calculations on the resulting collections. While one can write a standalone map reduce program for each one of the steps, the logistics of managing the flow of these jobs and all of the intermediate data quickly becomes non-trivial. As we will discuss in coming lectures, Pig—a data processing language that is built on top of MapReduce—facilitates and simplifies the execution of these and other common MapReduce operations, including joins, filtering, sorting, and grouping.

Lecture 2: An Introduction to Counting

This week marked the first of three lectures on counting. While this may seem like a lot of time to devote to a very simple concept, we’ll see that counting at scale is surprisingly more powerful—and unfortunately somewhat more difficult—than one might think.

Take, for instance, the questions we discussed around the long tail of consumption last week: How does interest in niche content vary across individuals, and what’s the impact of inventory size on customer satisfaction? At their core, these questions can be rephrased as simple counting problems where we first group the data (e.g., by user, by item, etc.) and then compute distributions and statistics of interest within each group.

For example, one approach to quantifying users’ eccentricity is to first rank movies by popularity and then compute the median rank of the movies rated by each individual. The first operation is a group-by-movie followed by a simple count within each group to measure movie popularity, while the second is a group-by-user followed by a median within each group to measure the popularity of a typical movie watched by each user. Mainstream users who tend to watch more popular content will score lower on this eccentricity measure, while more niche individuals will have larger eccentricity scores due to their consumption of tail content.

Such calculations are relatively easy for small datasets (e.g., 10M ratings from MovieLens), as various group-by and counting operations are quite fast once we’ve read all of the data from disk into main memory. The same analysis becomes a bit more challenging for larger datasets (e.g., Netflix) that don’t fit in memory, as random access from disk is about 1000x slower than equivalent lookups in memory. One solution is to appeal to much faster sequential disk access to stream through the data one observation at a time, storing and updating only the statistics of interest in memory.

For instance, to compute the average rating for each movie in the Netflix dataset, we can simply stream through the file and examine one rating at a time, updating a running mean for each movie as it appears. Doing so requires that we hold a small dictionary over all 20k movies in memory rather than the entire dataset of all 100 million ratings, reducing our memory footprint from gigabytes to megabytes.

This same approach scales to billions or trillions of observations so long as the number of distinct groups (e.g., movies) remains small, and can be extended to arbitrary descriptive statistics when we can afford to hold a dictionary of counts over distinct outcomes (e.g., 5 rating values) within each group. For more restrictive settings where we have a reasonable number of groups (e.g., users) but a potentially large number of distinct outcomes within each group (e.g., each movie’s popularity), we may lose the ability to store within-group distributions in memory or compute quantiles, but can still compute combinable statistics such as counts and means via incremental updates over each observation. Finally, when we have an extremely large number of groups (e.g., billions of videos on YouTube), even simple within-group statistics (e.g., number of views by video) become intractable for arbitrarily ordered input data.

These scenarios are summarized in the following table, where N is the total number of observations, G is the number of groups, V is the average number of distinct outcomes within groups, and the data are assumed to be in arbitrary order.

Memory Scenario Distributions Statistics Example
N Small dataset Yes General Arbitrary stats on MovieLens
V*G Small distributions Yes General Median rating by movie for Netflix
G Small # groups No Combinable Mean rating by movie for Netflix
V Small # outcomes No No Mean movie popularity by user for Netflix
1 Large # both No No Mean video popularity for YouTube

As we’ll see in the next lecture, the set of computable statistics improves substantially for pre-grouped input data, where all observations for a given group appear sequentially. As a result we’ll shift our attention to MapReduce, a system designed to efficiently solve the group-by problem for large-scale data.


  • The course GitHub page has code to download the MovieLens dataset and R scripts for the user eccentricity analysis, as well as code for simple streaming analysis of the Netflix data.
  • Here are a few R tutorials and shell tutorials if you’re new to these tools.
  • Linux and Mac OS X users should have pre-installed GNU tools including sed, awk, grep, cut, and uniq. Windows users can download Cygwin or MinGW to install these tools.
  • See these notes on streaming variance algorithms for an glimpse at some more interesting combinable statistics.

Lecture 1: Case Studies in Computational Social Science

This week’s lecture provided a high-level introduction to computational social science, an emerging field at the intersection of social science, statistics, and computer science that aims to use large-scale, individual-level data on who people are (demographics), what they do (behavioral & temporal data), and who they know (networks) to further our understanding of human behavior. In contrast to traditional approaches to social science (e.g., surveys, aggregate data analysis, and lab experiments), the large-scale nature of these data present unique computational and methodological challenges that will be discussed throughout the course.

We discussed a basic research loop, with three broad but important steps:

  1. Formulate the question.
  2. Find/collect data.
  3. Address methodological challenges.

While it may be tempting to think of this as a simple, linear process, in practice we often find ourselves iterating through the loop several times to address various complications that arise. For instance, we may need to refine the motivating question given methodological challenges or limitations of the available data, or develop novel methods to deal with computational issues.

We then discussed several different questions from a variety of domains:

  • Marketing: The long tail of consumption. What is the impact of inventory size on customer satisfication? How does interest in niche content vary across individuals?

  • Political science: The convention bounce. Do people actually switch which candidate they support? Are there overall population-level shifts in candidate support?

  • Demography: The digital divide. Does Internet access/usage affect health, education, and employment outcomes? How do Internet usage patterns vary across subpopulations?

  • Economics: Auction design. How do you optimally set auction parameters (e.g., reserve prices, “buy it now”, etc.)?

  • Communication theory: Information diffusion. How do ideas and products spread through society? What is the empirical structure of diffusion cascades?

The technical details of investigating these questions will be the subject of subsequent lectures, but curious readers can find more information in the following papers.