Author Archives: Sergei Vassilvitskii

Lecture 10: Experimental Design

This lecture focused on the difficulty of experimental design. At first glance, experimental design is easy: separate subjects into a treatment and a control group, administer the treatment and measure the effect. However, the devil lies in the details and small changes in the methodology may lead to subtle errors and ultimately incorrect conclusions.

The first point to address is the splitting criteria. A famously incorrect way to measure advertising effectiveness is to look at correlations on overall revenue with advertising spend. However, these two are inherently correlated, since one tends to spend more on advertising precisely when revenue is expected to be higher (for example: a ski shop advertising in late fall, or black friday sales the day after thanksgiving). This fallacy is especially prevalent in online experiments, where the people who are more active online are different (from a demographic perspective) than those who are less active. Just imagine whether you use the Internet the same way as your parents or your grandparents. The solution is to randomize across the population so that every subject has the same chance of being in each group.

Often explicit experiments are hard to perform, and a natural approach is to work with observational data. Here one has to worry about the subtle problem exemplified by the Simpsons paradox. If the control/treatment decision is conditioned on a latent variable, which is unknown to the experimenter, the experiment may lead to incorrect results. A famous example is the Berkeley gender discrimination lawsuit. The numbers showed the admission rate for men to the university was significantly higher than that for women. Further explanation showed that most departments actually had a slight bias towards women in their admissions, and the overall data was explained by the fact that women tended to apply to more competitive departments.

Controlled experiments don’t suffer from Simpson’s paradox, and have many other advantages in the online setting. Online experiments can reach literally millions of people, and thus can be used to measure very small effects (Lewis et al. 2011). They can be relatively cheap to run with platforms like Amazon’s mechanical turk (Mason and Suri 2013). And can be used to recruit diverse subjects, rather than the typical “undergraduates at a large midwestern university, ” which can lead to drastically different conclusions (Henrick et al. 2010}). The only major downside comes from the fact that people may behave differently online as they do offline.

WHere controlled experiments may seem contrived, and observational data leads to inconclusive results, natural experiments can help. In natural experiments some minor aspect of the system causes different treatments to be presented to different people, in a way that the subjects cannot control. We talked about three such experiments in class: measuring the effect of ad wear out (Lewis et al. 2011), the effect of yelp ratings on restaurant revenue (Luca 2011) and the effect that gamification and badges have on user behavior in online communities (Oktay et al. 2010).

Overall, controlled experiments, observational studies and natural experiments are complementary approaches to studying human behavior.

Mason and Suri, “Conducting Behavioral Research on Amazon’s Mechanical Turk”

Lewis et al., “Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising”

Henrich et al., The WEIRDest people in the world?”

Mike Luca, “Reviews, Reputation, and Revenue: The Case of

Oktay et al., “Causal Discovery in Social Media Using Quasi-Experimental Designs”

Lectures 7 & 8: Complexity of Counting

So far in the course we learned that careful counting is behind many social science results. But how hard is it to count? For example, suppose we have two (unsorted) lists of numbers, each with 100 million unique items. How long will it take to find the size of the intersection of these two sets? The most straightforward idea is to check every element in set one against every element in set two. This is simple and correct, however the number of comparisons required grows as the product of the size of two sets. Thus for two lists of 100M numbers each, even if we could do 1B comparisons per second, the calculation will take about 10 million seconds, or more than three months!

The reason this approach is infeasible is that the underlying algorithm is quadratic—if the size of each of the sets were to triple, we would now need to perform nine (three squared) times as many comparisons—and for large datasets quadratic algorithms don’t scale. Instead, we often look for linear time algorithms. For such approaches if the size of the input were to triple, the running time would increase by a factor of three as well. In the worst case, we consider algorithms with running time slightly above linear, for example just sorting a list of 100M items takes about a billion comparisons. More generally for an input of size n, sorting the input takes approximately n log n comparisons. Although this is more than linear—tripling the input increases the running time by more than a factor of three, it is only slightly more so.

Coming back to the intersection problem, there are many possible algorithms for finding the intersection of the two lists. To beat the quadratic time barrier it helps to preprocess the input before doing the comparisons. Without preprocessing, the only way to check whether an item from the first list appears in the second list is to scan and compare it to all of the elements, which leads to the quadratic time algorithm. Suppose instead, that such a membership test (testing whether a specific item is in the list) were faster, taking only constant time, then the overall algorithm would run in linear time! A data structure known as a hashtable does exactly this—it’s a way of preprocessing the input to answer only membership queries, but to answer them very quickly, on average in constant time.

In the second lecture we saw that in addition to the overall running time, an important consideration is the skew in the data. We already saw how the skew makes some statistics (for example the average) somewhat misleading when describing a dataset. For example, in real world networks, some nodes are going to have much higher degrees (e.g. friends or followers) than others. Taking this skew into account is crucial, especially when parallelizing the computation across many machines, as naive parallelizations may result in many machines finishing very quickly, in a matter of seconds, whereas the last few take hours to run.

We saw this demonstrated with triangle counting algorithms. The most straightforward approaches run in time quadratic in the degree of each node, which is fine for the majority of the nodes, but causes problems for the very high degree nodes (quadratic algorithms can’t handle very large instances). There are many ways to rectify this, for example treating the very high degree nodes differently by splitting them smaller subsets, and carefully adjusting the computation. An alternative approach is to assign each triangle to be counted from the perspective of the lowest degree node. Although this approach is also quadratic, it is only quadratic in the number of neighbors who have even higher degree, which makes for much more balanced computation.

Lecture 7:

Lecture 8:

Lecture 3: Counting at Scale, Part I

Last week we saw that many social science questions can be answered by simply counting relevant quantities in the data. For example looking at the median rank of movies viewed by an individual allowed us to compute the persons’s eccentricity, a value that we then used to gauge the impact of larger catalogs on people’s happiness.

However, as datasets grow in size, even simple counting becomes a time consuming task on a single computer. Moreover, CPUs have largely stopped getting faster; instead, all modern processors exhibit multiple cores, which can do work simultaneously, and the parallelism doesn’t stop there—modern data centers have hundreds of machines each with multiple CPUs, each with multiple cores. An obvious question arises: how can we distribute counting tasks across machines and cores to take advantage of this massive computational power?

A key principle in distribution is to split up the computation so as to minimize communication between different machines. For example, suppose we have a dataset where every line represents a phone call, showing the caller, time of call, duration of call and the number dialed, and we want to identify the person who is receiving the most number of calls. If we partition the data by the callee, then we can guarantee that all calls to a particular person end up on the same machine, making it easy for each machine to calculate the most popular callee from those assigned to it. If, on the other hand, we partition by the caller, we would still need to aggregate the data across all of the machines to find out how many times an individual was dialed.

This simple example highlights that, unfortunately, there is no perfect split of the data—for example, if we wanted to instead find the person who called the most people, we would partition by the caller, not the callee; if we wanted to find the person who spent the most time on the phone, we would aggregate all phone calls to and from the same person on the same machine. A key realization here is that while the specific aggregation function differs in each case, all of these problems can be handled by one underlying infrastructure. MapReduce, first introduced by Dean and Ghemawat, is one such infrastructure that decomposes such tasks into two simple functions: “map”, which specifies how the data are to be partitioned, and “reduce”, which governs what happens on every partition.

Specifically, the MapReduce system (and its open source implementation, Hadoop), treats all data as (key, value) pairs, which the programmer writes map and reduce functions to control. In the above example, the key may be the timestamp, and the value encapsulates the caller, callee and duration of call. In the map step we define how we want the data partitioned by producing a key for each row. The MapReduce system then performs a distributed group-by, guaranteeing that all elements with the same key end up on the same machine. Thus if we want to aggregate by the caller, we set the key to be the caller_id; if we would rather aggregate by the callee, we set the key to be the id of the recipient.

In the reduce step, the programmer specifies what to do with the list of values associated with each key. If we are looking for the most popular callee, we count the number of unique people who called this person. If we are interested in the person with the largest phone bill, we count the total duration of all phone calls made by the same caller, etc.

This simple divide-and-conquer abstraction—telling the system first how to partition the data (map) and then what to do on each partition (reduce)—is immensely powerful. It easily scales to efficiently utilize thousands of machines, and allows us to efficiently compute on multi-terabyte sized inputs. We will explore the full power of this paradigm in the coming weeks.


  • Python scripts for local MapReduce and simple wordcount examples are on the course GitHub page.