Monthly Archives: April 2013

Lecture 10: Experimental Design

This lecture focused on the difficulty of experimental design. At first glance, experimental design is easy: separate subjects into a treatment and a control group, administer the treatment and measure the effect. However, the devil lies in the details and small changes in the methodology may lead to subtle errors and ultimately incorrect conclusions.

The first point to address is the splitting criteria. A famously incorrect way to measure advertising effectiveness is to look at correlations on overall revenue with advertising spend. However, these two are inherently correlated, since one tends to spend more on advertising precisely when revenue is expected to be higher (for example: a ski shop advertising in late fall, or black friday sales the day after thanksgiving). This fallacy is especially prevalent in online experiments, where the people who are more active online are different (from a demographic perspective) than those who are less active. Just imagine whether you use the Internet the same way as your parents or your grandparents. The solution is to randomize across the population so that every subject has the same chance of being in each group.

Often explicit experiments are hard to perform, and a natural approach is to work with observational data. Here one has to worry about the subtle problem exemplified by the Simpsons paradox. If the control/treatment decision is conditioned on a latent variable, which is unknown to the experimenter, the experiment may lead to incorrect results. A famous example is the Berkeley gender discrimination lawsuit. The numbers showed the admission rate for men to the university was significantly higher than that for women. Further explanation showed that most departments actually had a slight bias towards women in their admissions, and the overall data was explained by the fact that women tended to apply to more competitive departments.

Controlled experiments don’t suffer from Simpson’s paradox, and have many other advantages in the online setting. Online experiments can reach literally millions of people, and thus can be used to measure very small effects (Lewis et al. 2011). They can be relatively cheap to run with platforms like Amazon’s mechanical turk (Mason and Suri 2013). And can be used to recruit diverse subjects, rather than the typical “undergraduates at a large midwestern university, ” which can lead to drastically different conclusions (Henrick et al. 2010}). The only major downside comes from the fact that people may behave differently online as they do offline.

WHere controlled experiments may seem contrived, and observational data leads to inconclusive results, natural experiments can help. In natural experiments some minor aspect of the system causes different treatments to be presented to different people, in a way that the subjects cannot control. We talked about three such experiments in class: measuring the effect of ad wear out (Lewis et al. 2011), the effect of yelp ratings on restaurant revenue (Luca 2011) and the effect that gamification and badges have on user behavior in online communities (Oktay et al. 2010).

Overall, controlled experiments, observational studies and natural experiments are complementary approaches to studying human behavior.

Mason and Suri, “Conducting Behavioral Research on Amazon’s Mechanical Turk”

Lewis et al., “Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising”

Henrich et al., The WEIRDest people in the world?”

Mike Luca, “Reviews, Reputation, and Revenue: The Case of

Oktay et al., “Causal Discovery in Social Media Using Quasi-Experimental Designs”

Lecture 9: Data Wrangling

This is a guest lecture and post by John Myles White.

In this lecture, we talked about methods for getting data. We ranked methods in terms of their ease of use. For example, the easiest method was getting a bulk download of an entire data set. We noted that there are several clearinghouses of data that link to many publicly available data sets, including data from Wikipedia, IMDB,, and others. When these bulk downloads are not available, we noted that many web sites (e.g., NYTimes, Twitter, Google, etc.) offer API access with which you can download chunks of data at a time and slowly accumulate a large body of data. When even this is not possible, we noted that one can scrape data from sites so long as the Terms of Service allow automated access to the site using tools such as BeautifulSoup or Nokogiri.

In the second half of the lecture, we discussed how to work with the data that one acquires from websites using any of the three methods above. This data is often structured in formats like JSON and XML that must be parsed by the user using formal parsing libraries available in many popular languages like Python. Sometimes the data is in an unstructured format in which we simply want to extract basic information like phone numbers: we described the use of regular expressions as a mechanism for extracting this information. We worked through an extended example of building a regular expression that would match phone numbers.

Homework 2

The second homework is posted.

The first problem is a simple word count exercise over the Wikipedia corpus, the second examines Wikipedia page popularity, and the third explores tie strength between co-authors.

See Amazon’s getting started videos the references from lectures 5 and 6 for more information on Pig, EC2, and Elastic Mapreduce.

Some tips:

  1. Use the template solution files to test and debug Pig scripts on your local machine.

  2. Create a bucket with a unique name (e.g., your UNI) using the S3 console:
    Step 1

  3. Upload your locally tested Pig script to S3:
    Step 2

  4. Create a Pig job flow in the Elastic Mapreduce console:
    Step 3

  5. Specify the path to your Pig script on S3, along with input and output paths:
    Step 4

  6. Select the number of instances (5 small instances should be sufficient):
    Step 5

  7. Specify a log path for debugging and a keypair if you’d like to log into the cluster while the job is running:
    Step 6

  8. To avoid an error in allocating heap space for Java when the job starts, select the “Memory Intensive Configuration” bootstrap script:
    Step 7

  9. Review job details and submit the job:
    Step 8

  10. Monitor the job status through the Elastic Mapreduce console or log into the machine with ssh (or Putty) and check the JobTracker with lynx:

    ssh -i /path/to/keypair.pem
    $ lynx http://localhost:9100