Lecture 10: Experimental Design

This lecture focused on the difficulty of experimental design. At first glance, experimental design is easy: separate subjects into a treatment and a control group, administer the treatment and measure the effect. However, the devil lies in the details and small changes in the methodology may lead to subtle errors and ultimately incorrect conclusions.

The first point to address is the splitting criteria. A famously incorrect way to measure advertising effectiveness is to look at correlations on overall revenue with advertising spend. However, these two are inherently correlated, since one tends to spend more on advertising precisely when revenue is expected to be higher (for example: a ski shop advertising in late fall, or black friday sales the day after thanksgiving). This fallacy is especially prevalent in online experiments, where the people who are more active online are different (from a demographic perspective) than those who are less active. Just imagine whether you use the Internet the same way as your parents or your grandparents. The solution is to randomize across the population so that every subject has the same chance of being in each group.

Often explicit experiments are hard to perform, and a natural approach is to work with observational data. Here one has to worry about the subtle problem exemplified by the Simpsons paradox. If the control/treatment decision is conditioned on a latent variable, which is unknown to the experimenter, the experiment may lead to incorrect results. A famous example is the Berkeley gender discrimination lawsuit. The numbers showed the admission rate for men to the university was significantly higher than that for women. Further explanation showed that most departments actually had a slight bias towards women in their admissions, and the overall data was explained by the fact that women tended to apply to more competitive departments.

Controlled experiments don’t suffer from Simpson’s paradox, and have many other advantages in the online setting. Online experiments can reach literally millions of people, and thus can be used to measure very small effects (Lewis et al. 2011). They can be relatively cheap to run with platforms like Amazon’s mechanical turk (Mason and Suri 2013). And can be used to recruit diverse subjects, rather than the typical “undergraduates at a large midwestern university, ” which can lead to drastically different conclusions (Henrick et al. 2010}). The only major downside comes from the fact that people may behave differently online as they do offline.

WHere controlled experiments may seem contrived, and observational data leads to inconclusive results, natural experiments can help. In natural experiments some minor aspect of the system causes different treatments to be presented to different people, in a way that the subjects cannot control. We talked about three such experiments in class: measuring the effect of ad wear out (Lewis et al. 2011), the effect of yelp ratings on restaurant revenue (Luca 2011) and the effect that gamification and badges have on user behavior in online communities (Oktay et al. 2010).

Overall, controlled experiments, observational studies and natural experiments are complementary approaches to studying human behavior.

Mason and Suri, “Conducting Behavioral Research on Amazon’s Mechanical Turk” http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1691163

Lewis et al., “Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising” http://www2011india.com/proceeding/proceedings/p157.pdf

Henrich et al., The WEIRDest people in the world? http://www.econstor.eu/bitstream/10419/43616/1/626014360.pdf”

Mike Luca, “Reviews, Reputation, and Revenue: The Case of Yelp.com http://ctct.wpengine.com/wp-content/uploads/2011/10/12-016.pdf

Oktay et al., “Causal Discovery in Social Media Using Quasi-Experimental Designs” http://people.cs.umass.edu/~hoktay/pub/soma2010.pdf