# Lectures 11 & 12: Regression

These lectures explored both the practice of and theory behind applied regression analysis. As stated by Cook & Weisberg, regression analysis aims “to understand, as far as possible with the available data, how the conditional distribution of the response varies across subpopulations determined by the possible values of the predictor(s).” This includes everything from simple descriptive statistics discussed in our earlier lectures to more sophisticated statistical models for situations in which we have a small number of observations relative to the number of subpopulations. As Berk explains in his constructive critique, the aim of such analyses may vary from providing a compact description of the data, to predicting future outcomes, to explaining associations between outcomes and predictors.

We discussed two case studies: Predicting consumer behavior with Web search and A Large-Scale Study of Browsing Behavior. The first explored the relative value of various data sources for predicting the performance of movies, music, and video games, using simple auto-regressive models to generate future sales predictions from current and past performance. The second study investigated variation in how different demographic groups use the Web, making use of linear models to smooth descriptive statistics over subpopulations and control for covariation between predictors (e.g., education and race).

In the following lecture we reviewed the formalism behind ordinary least squares regression, phrased as maximum likelihood inference under the assumption of additive Gaussian noise. We then derived the closed-form solution to the normal equations for small-scale problems and discussed alternative optimization methods, namely gradient descent and stochastic gradient descent, for larger-scale settings.

Ordinary least squares regression is, in principle, easily solved by inverting the normal equations:
$$\hat{w} = (X^T X)^{-1} X^T y.$$
In practice, however, it often computationally expensive to do necessary the matrix inversion for models with many features, even with specialized numerical methods.

Gradient descent offers an alternative solution to the normal equations, replacing potentially expensive matrix inversion with an iterative method where we update parameters by moving in the direction of steepest increase of the likelihood landscape:
$$\hat{w} \leftarrow \hat{w} + \eta X^T (y – X\hat{w}),$$
where eta is a tunable step size. Choosing eta too small leads to slow convergence, whereas too large a step size may result in undesirable oscillations about local optima. Intuitively, gradient descent updates each component of the weight vector by a sum of the corresponding feature values over all examples, where examples are weighted by the error between actual and predicted labels. Stochastic gradient descent provides an even more scalable alternative where an approximation to the gradient from a subsample of the data is used in place of the entire dataset.

See Chapter 3 of Bishop and Chapter 3 of Hastie for reference.

# Lecture 13: Classification

In this lecture we discussed classification methods for predicting discrete-valued targets (e.g., spam classification or gender identification). We noted several potential issues in directly applying linear regression to classification problems and explored naive Bayes and logistic regression as alternatives.

We first reviewed Bayes’ rule for inverting conditional probabilities via a simple, but perhaps counterintuitive, medical diagnosis example and then adapted this to an (extremely naive) one-feature classifier. We improved upon this by considering naive Bayes—a simple linear method for classification in which we model each feature independently. While the independence assumption is almost definitely incorrect, naive Bayes turns out to work well in practice. In addition, naive Bayes is simple to train and predict with at scale. Unfortunately, however, it does fail to account for correlations amongst features.

Logistic regression addresses this issue by modeling the class-conditional probabilities directly, using a logistic function to transform predictions from a linear model to lie in the unit interval: $$p(y=1|x, w) = {1 \over 1 + e^{-w \cdot x}}$$ While maximum likelihood inference for logistic regression does not permit a closed-form solution, gradient descent results in the following update equations, similar to linear regression: $$\hat{w} \leftarrow \hat{w} + \eta X^T (y – p).$$ In smaller-scale settings one can improve on these updates by using second-order methods such as Newton-Raphson that leverage the local curvature of the likelihood landscape to determine the step size at each iteration. As with regression, some form of regularization is often useful for balancing the fit to training data with generalization error when one has a relatively large number of features.

References include Chapter 4 of Bishop, Chapter 4 of Hastie, Chapter 6 of Segaran, Horvitz, et. al., 1998, Lewis, 1998, Graham, 2002, and Metsis, et. al., 2006.

# Lecture 10: Experimental Design

This lecture focused on the difficulty of experimental design. At first glance, experimental design is easy: separate subjects into a treatment and a control group, administer the treatment and measure the effect. However, the devil lies in the details and small changes in the methodology may lead to subtle errors and ultimately incorrect conclusions.

The first point to address is the splitting criteria. A famously incorrect way to measure advertising effectiveness is to look at correlations on overall revenue with advertising spend. However, these two are inherently correlated, since one tends to spend more on advertising precisely when revenue is expected to be higher (for example: a ski shop advertising in late fall, or black friday sales the day after thanksgiving). This fallacy is especially prevalent in online experiments, where the people who are more active online are different (from a demographic perspective) than those who are less active. Just imagine whether you use the Internet the same way as your parents or your grandparents. The solution is to randomize across the population so that every subject has the same chance of being in each group.

Often explicit experiments are hard to perform, and a natural approach is to work with observational data. Here one has to worry about the subtle problem exemplified by the Simpsons paradox. If the control/treatment decision is conditioned on a latent variable, which is unknown to the experimenter, the experiment may lead to incorrect results. A famous example is the Berkeley gender discrimination lawsuit. The numbers showed the admission rate for men to the university was significantly higher than that for women. Further explanation showed that most departments actually had a slight bias towards women in their admissions, and the overall data was explained by the fact that women tended to apply to more competitive departments.

Controlled experiments don’t suffer from Simpson’s paradox, and have many other advantages in the online setting. Online experiments can reach literally millions of people, and thus can be used to measure very small effects (Lewis et al. 2011). They can be relatively cheap to run with platforms like Amazon’s mechanical turk (Mason and Suri 2013). And can be used to recruit diverse subjects, rather than the typical “undergraduates at a large midwestern university, ” which can lead to drastically different conclusions (Henrick et al. 2010}). The only major downside comes from the fact that people may behave differently online as they do offline.

WHere controlled experiments may seem contrived, and observational data leads to inconclusive results, natural experiments can help. In natural experiments some minor aspect of the system causes different treatments to be presented to different people, in a way that the subjects cannot control. We talked about three such experiments in class: measuring the effect of ad wear out (Lewis et al. 2011), the effect of yelp ratings on restaurant revenue (Luca 2011) and the effect that gamification and badges have on user behavior in online communities (Oktay et al. 2010).

Overall, controlled experiments, observational studies and natural experiments are complementary approaches to studying human behavior.

Mason and Suri, “Conducting Behavioral Research on Amazon’s Mechanical Turk” http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1691163

Lewis et al., “Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising” http://www2011india.com/proceeding/proceedings/p157.pdf

Henrich et al., The WEIRDest people in the world? http://www.econstor.eu/bitstream/10419/43616/1/626014360.pdf”

Mike Luca, “Reviews, Reputation, and Revenue: The Case of Yelp.com http://ctct.wpengine.com/wp-content/uploads/2011/10/12-016.pdf

Oktay et al., “Causal Discovery in Social Media Using Quasi-Experimental Designs” http://people.cs.umass.edu/~hoktay/pub/soma2010.pdf

# Lecture 9: Data Wrangling

This is a guest lecture and post by John Myles White.

In the second half of the lecture, we discussed how to work with the data that one acquires from websites using any of the three methods above. This data is often structured in formats like JSON and XML that must be parsed by the user using formal parsing libraries available in many popular languages like Python. Sometimes the data is in an unstructured format in which we simply want to extract basic information like phone numbers: we described the use of regular expressions as a mechanism for extracting this information. We worked through an extended example of building a regular expression that would match phone numbers.

# Lectures 7 & 8: Complexity of Counting

So far in the course we learned that careful counting is behind many social science results. But how hard is it to count? For example, suppose we have two (unsorted) lists of numbers, each with 100 million unique items. How long will it take to find the size of the intersection of these two sets? The most straightforward idea is to check every element in set one against every element in set two. This is simple and correct, however the number of comparisons required grows as the product of the size of two sets. Thus for two lists of 100M numbers each, even if we could do 1B comparisons per second, the calculation will take about 10 million seconds, or more than three months!

The reason this approach is infeasible is that the underlying algorithm is quadratic—if the size of each of the sets were to triple, we would now need to perform nine (three squared) times as many comparisons—and for large datasets quadratic algorithms don’t scale. Instead, we often look for linear time algorithms. For such approaches if the size of the input were to triple, the running time would increase by a factor of three as well. In the worst case, we consider algorithms with running time slightly above linear, for example just sorting a list of 100M items takes about a billion comparisons. More generally for an input of size n, sorting the input takes approximately n log n comparisons. Although this is more than linear—tripling the input increases the running time by more than a factor of three, it is only slightly more so.

Coming back to the intersection problem, there are many possible algorithms for finding the intersection of the two lists. To beat the quadratic time barrier it helps to preprocess the input before doing the comparisons. Without preprocessing, the only way to check whether an item from the first list appears in the second list is to scan and compare it to all of the elements, which leads to the quadratic time algorithm. Suppose instead, that such a membership test (testing whether a specific item is in the list) were faster, taking only constant time, then the overall algorithm would run in linear time! A data structure known as a hashtable does exactly this—it’s a way of preprocessing the input to answer only membership queries, but to answer them very quickly, on average in constant time.

In the second lecture we saw that in addition to the overall running time, an important consideration is the skew in the data. We already saw how the skew makes some statistics (for example the average) somewhat misleading when describing a dataset. For example, in real world networks, some nodes are going to have much higher degrees (e.g. friends or followers) than others. Taking this skew into account is crucial, especially when parallelizing the computation across many machines, as naive parallelizations may result in many machines finishing very quickly, in a matter of seconds, whereas the last few take hours to run.

We saw this demonstrated with triangle counting algorithms. The most straightforward approaches run in time quadratic in the degree of each node, which is fine for the majority of the nodes, but causes problems for the very high degree nodes (quadratic algorithms can’t handle very large instances). There are many ways to rectify this, for example treating the very high degree nodes differently by splitting them smaller subsets, and carefully adjusting the computation. An alternative approach is to assign each triangle to be counted from the perspective of the lowest degree node. Although this approach is also quadratic, it is only quadratic in the number of neighbors who have even higher degree, which makes for much more balanced computation.

Lecture 7:

Lecture 8:

# Lectures 5 & 6: Networks

These two lectures introduced network theory, following the first part of “Networks, Crowds, and Markets” by Easley & Kleinberg. We discussed basic representations of relational data, basic algorithms for computations on graphs, and common features of real-world networks.

Slides are below. See “Predicting Individual Behavior with Social Networks “ and “The Structure of Online Diffusion Networks “ for more details of the case studies we covered, and the references that follow for information on using Pig and Amazon’s computing services.

References:

• Hadoop is an open source framework for scalable storage and computing, including a distributed filesystem and scalable implementation of MapReduce.
• Pig is a high-level language that converts sequences of common data analysis operations (e.g., group-bys, joins, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster.
• Pig is easy to download and try on Linux, Mac OS X, or Windows (requires Java). See RELEASE_NOTES.txt for information on trying Pig in local mode.
• The data types in Pig are slightly different than in most languages, including bags, tuples, and fields.
• The diagnostic operators in Pig are useful for understanding code and debugging. In particular, the DUMP operator shows contents of relations and the ILLUSTRATE operator shows how relations are transformed through a Pig program.
• Programming Pig is a freely available book with an accompanying GitHub page of examples.
• In addition to running Hadoop and Pig locally, Amazon’s Elastic Compute Cloud (EC2) and Scalable Storage Service (S3) provide pay-per-use computation and storage that can be used to construct rentable clusters for large-scale data analysis.
• Elastic MapReduce (EMR) facilitates this process, allowing one to easily submit MapReduce jobs without needing to configure low-level cluster details.
• Amazon’s getting started guide walks through the details of creating an account and launching/logging into a simple Linux instance. Be sure to create a keypair for this last step with the AWS console. You can use the ssh command to log in to machines on Linux, Mac OS X, or in Cygwin on Windows; alternatively you can use the AWS console or Putty to access remote machines.
• The process for running a Hadoop/Pig job is similar, using the EMR Console. You can alternatively select to run an interactive session and log into the cluster as opposed to submitting just a single job. More details on interactive sessions and individual job flows are available through Amazon’s tutorial articles.
• Amazon also provides a number of public data sets for analysis on EMR, including a copy of the Common Crawl Corpus of several billion web pages.

# Lecture 4: Counting at Scale, Part II

The simple divide and conquer abstraction of MapReduce that we saw in the past lecture is immensely powerful for large scale data analysis. Many of the common statistics that we are interested in computing fall into this formulation.

For example, to find the maximum of a set of numbers, we can partition the dataset into multiple groups, find the maximum of each group and return the maximum of the group maxima. To find the average, we find the total sum associated with each group, as well as the number of elements in each, then combine these in an obvious manner. Not all functions can be computed this way—for example computing the median is much harder, as we cannot simply look at the median of each group. In fact, there is no deterministic algorithm that finds the median in one pass through the data without keeping all of the data in memory. Rather this requires a multi-pass approach, first finding the approximate range where the median lies, and then iterating through the data again to find the exact median.

While computing summary statistics is relatively easy once the data is in the right form, getting to this point usually requires a fair amount of work. For instance, we often have multiple input sources, each describing different aspects of the data—we may have one file describing the ratings that each person gave to each movie, a separate file with movie details (e.g., year of release, director, etc.), and yet another file containing user information (e.g., demographics, reported location, etc.). Thus if we want to find the movie from 1985 most watched by people in New York, we have to combine all of these datasets before issuing a simple maximum query.

This kind of combination is formally called a join. There are many possible kinds of joins—left, right, inner, outer, etc.—each of which performs the combination in a particular manner. For example, given a dataset mapping each user rating to a movie, and a movie to its year of release, an inner join on the movie name will produce a three column table—with a rating, movie name, and release year, but only for those movies that have both a rating and a release year; any incomplete rows in the resulting table will be dropped. On the other hand, a left join will keep all entries of the first table (here the movie ratings table) even for movies that have no release year, but will not produce movies with a release year but no ratings. Conversely, a right join will keep all entries of the second table, even for movies that have no ratings; finally, a full outer join will keep all entries and will fill in missing data whenever possible.

Typical analyses require multiple such joins followed by the relevant calculations on the resulting collections. While one can write a standalone map reduce program for each one of the steps, the logistics of managing the flow of these jobs and all of the intermediate data quickly becomes non-trivial. As we will discuss in coming lectures, Pig—a data processing language that is built on top of MapReduce—facilitates and simplifies the execution of these and other common MapReduce operations, including joins, filtering, sorting, and grouping.

# Lecture 3: Counting at Scale, Part I

Last week we saw that many social science questions can be answered by simply counting relevant quantities in the data. For example looking at the median rank of movies viewed by an individual allowed us to compute the persons’s eccentricity, a value that we then used to gauge the impact of larger catalogs on people’s happiness.

However, as datasets grow in size, even simple counting becomes a time consuming task on a single computer. Moreover, CPUs have largely stopped getting faster; instead, all modern processors exhibit multiple cores, which can do work simultaneously, and the parallelism doesn’t stop there—modern data centers have hundreds of machines each with multiple CPUs, each with multiple cores. An obvious question arises: how can we distribute counting tasks across machines and cores to take advantage of this massive computational power?

A key principle in distribution is to split up the computation so as to minimize communication between different machines. For example, suppose we have a dataset where every line represents a phone call, showing the caller, time of call, duration of call and the number dialed, and we want to identify the person who is receiving the most number of calls. If we partition the data by the callee, then we can guarantee that all calls to a particular person end up on the same machine, making it easy for each machine to calculate the most popular callee from those assigned to it. If, on the other hand, we partition by the caller, we would still need to aggregate the data across all of the machines to find out how many times an individual was dialed.

This simple example highlights that, unfortunately, there is no perfect split of the data—for example, if we wanted to instead find the person who called the most people, we would partition by the caller, not the callee; if we wanted to find the person who spent the most time on the phone, we would aggregate all phone calls to and from the same person on the same machine. A key realization here is that while the specific aggregation function differs in each case, all of these problems can be handled by one underlying infrastructure. MapReduce, first introduced by Dean and Ghemawat, is one such infrastructure that decomposes such tasks into two simple functions: “map”, which specifies how the data are to be partitioned, and “reduce”, which governs what happens on every partition.

Specifically, the MapReduce system (and its open source implementation, Hadoop), treats all data as (key, value) pairs, which the programmer writes map and reduce functions to control. In the above example, the key may be the timestamp, and the value encapsulates the caller, callee and duration of call. In the map step we define how we want the data partitioned by producing a key for each row. The MapReduce system then performs a distributed group-by, guaranteeing that all elements with the same key end up on the same machine. Thus if we want to aggregate by the caller, we set the key to be the caller_id; if we would rather aggregate by the callee, we set the key to be the id of the recipient.

In the reduce step, the programmer specifies what to do with the list of values associated with each key. If we are looking for the most popular callee, we count the number of unique people who called this person. If we are interested in the person with the largest phone bill, we count the total duration of all phone calls made by the same caller, etc.

This simple divide-and-conquer abstraction—telling the system first how to partition the data (map) and then what to do on each partition (reduce)—is immensely powerful. It easily scales to efficiently utilize thousands of machines, and allows us to efficiently compute on multi-terabyte sized inputs. We will explore the full power of this paradigm in the coming weeks.

References:

• Python scripts for local MapReduce and simple wordcount examples are on the course GitHub page.

# Lecture 2: An Introduction to Counting

This week marked the first of three lectures on counting. While this may seem like a lot of time to devote to a very simple concept, we’ll see that counting at scale is surprisingly more powerful—and unfortunately somewhat more difficult—than one might think.

Take, for instance, the questions we discussed around the long tail of consumption last week: How does interest in niche content vary across individuals, and what’s the impact of inventory size on customer satisfaction? At their core, these questions can be rephrased as simple counting problems where we first group the data (e.g., by user, by item, etc.) and then compute distributions and statistics of interest within each group.

For example, one approach to quantifying users’ eccentricity is to first rank movies by popularity and then compute the median rank of the movies rated by each individual. The first operation is a group-by-movie followed by a simple count within each group to measure movie popularity, while the second is a group-by-user followed by a median within each group to measure the popularity of a typical movie watched by each user. Mainstream users who tend to watch more popular content will score lower on this eccentricity measure, while more niche individuals will have larger eccentricity scores due to their consumption of tail content.

Such calculations are relatively easy for small datasets (e.g., 10M ratings from MovieLens), as various group-by and counting operations are quite fast once we’ve read all of the data from disk into main memory. The same analysis becomes a bit more challenging for larger datasets (e.g., Netflix) that don’t fit in memory, as random access from disk is about 1000x slower than equivalent lookups in memory. One solution is to appeal to much faster sequential disk access to stream through the data one observation at a time, storing and updating only the statistics of interest in memory.

For instance, to compute the average rating for each movie in the Netflix dataset, we can simply stream through the file and examine one rating at a time, updating a running mean for each movie as it appears. Doing so requires that we hold a small dictionary over all 20k movies in memory rather than the entire dataset of all 100 million ratings, reducing our memory footprint from gigabytes to megabytes.

This same approach scales to billions or trillions of observations so long as the number of distinct groups (e.g., movies) remains small, and can be extended to arbitrary descriptive statistics when we can afford to hold a dictionary of counts over distinct outcomes (e.g., 5 rating values) within each group. For more restrictive settings where we have a reasonable number of groups (e.g., users) but a potentially large number of distinct outcomes within each group (e.g., each movie’s popularity), we may lose the ability to store within-group distributions in memory or compute quantiles, but can still compute combinable statistics such as counts and means via incremental updates over each observation. Finally, when we have an extremely large number of groups (e.g., billions of videos on YouTube), even simple within-group statistics (e.g., number of views by video) become intractable for arbitrarily ordered input data.

These scenarios are summarized in the following table, where N is the total number of observations, G is the number of groups, V is the average number of distinct outcomes within groups, and the data are assumed to be in arbitrary order.

Memory Scenario Distributions Statistics Example
N Small dataset Yes General Arbitrary stats on MovieLens
V*G Small distributions Yes General Median rating by movie for Netflix
G Small # groups No Combinable Mean rating by movie for Netflix
V Small # outcomes No No Mean movie popularity by user for Netflix
1 Large # both No No Mean video popularity for YouTube

As we’ll see in the next lecture, the set of computable statistics improves substantially for pre-grouped input data, where all observations for a given group appear sequentially. As a result we’ll shift our attention to MapReduce, a system designed to efficiently solve the group-by problem for large-scale data.

References:

• The course GitHub page has code to download the MovieLens dataset and R scripts for the user eccentricity analysis, as well as code for simple streaming analysis of the Netflix data.
• Here are a few R tutorials and shell tutorials if you’re new to these tools.
• Linux and Mac OS X users should have pre-installed GNU tools including sed, awk, grep, cut, and uniq. Windows users can download Cygwin or MinGW to install these tools.
• See these notes on streaming variance algorithms for an glimpse at some more interesting combinable statistics.

# Lecture 1: Case Studies in Computational Social Science

This week’s lecture provided a high-level introduction to computational social science, an emerging field at the intersection of social science, statistics, and computer science that aims to use large-scale, individual-level data on who people are (demographics), what they do (behavioral & temporal data), and who they know (networks) to further our understanding of human behavior. In contrast to traditional approaches to social science (e.g., surveys, aggregate data analysis, and lab experiments), the large-scale nature of these data present unique computational and methodological challenges that will be discussed throughout the course.

We discussed a basic research loop, with three broad but important steps:

1. Formulate the question.
2. Find/collect data.

While it may be tempting to think of this as a simple, linear process, in practice we often find ourselves iterating through the loop several times to address various complications that arise. For instance, we may need to refine the motivating question given methodological challenges or limitations of the available data, or develop novel methods to deal with computational issues.

We then discussed several different questions from a variety of domains:

• Marketing: The long tail of consumption. What is the impact of inventory size on customer satisfication? How does interest in niche content vary across individuals?

• Political science: The convention bounce. Do people actually switch which candidate they support? Are there overall population-level shifts in candidate support?

• Demography: The digital divide. Does Internet access/usage affect health, education, and employment outcomes? How do Internet usage patterns vary across subpopulations?

• Economics: Auction design. How do you optimally set auction parameters (e.g., reserve prices, “buy it now”, etc.)?

• Communication theory: Information diffusion. How do ideas and products spread through society? What is the empirical structure of diffusion cascades?

The technical details of investigating these questions will be the subject of subsequent lectures, but curious readers can find more information in the following papers.