Author Archives: Sharad Goel

Lectures 5 & 6: Networks

These two lectures introduced network theory, following the first part of “Networks, Crowds, and Markets” by Easley & Kleinberg. We discussed basic representations of relational data, basic algorithms for computations on graphs, and common features of real-world networks.

Slides are below. See “Predicting Individual Behavior with Social Networks “ and “The Structure of Online Diffusion Networks “ for more details of the case studies we covered, and the references that follow for information on using Pig and Amazon’s computing services.


  • Hadoop is an open source framework for scalable storage and computing, including a distributed filesystem and scalable implementation of MapReduce.
  • Pig is a high-level language that converts sequences of common data analysis operations (e.g., group-bys, joins, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster.
  • Pig is easy to download and try on Linux, Mac OS X, or Windows (requires Java). See RELEASE_NOTES.txt for information on trying Pig in local mode.
  • The data types in Pig are slightly different than in most languages, including bags, tuples, and fields.
  • The diagnostic operators in Pig are useful for understanding code and debugging. In particular, the DUMP operator shows contents of relations and the ILLUSTRATE operator shows how relations are transformed through a Pig program.
  • Programming Pig is a freely available book with an accompanying GitHub page of examples.
  • In addition to running Hadoop and Pig locally, Amazon’s Elastic Compute Cloud (EC2) and Scalable Storage Service (S3) provide pay-per-use computation and storage that can be used to construct rentable clusters for large-scale data analysis.
  • Elastic MapReduce (EMR) facilitates this process, allowing one to easily submit MapReduce jobs without needing to configure low-level cluster details.
  • Amazon’s getting started guide walks through the details of creating an account and launching/logging into a simple Linux instance. Be sure to create a keypair for this last step with the AWS console. You can use the ssh command to log in to machines on Linux, Mac OS X, or in Cygwin on Windows; alternatively you can use the AWS console or Putty to access remote machines.
  • The process for running a Hadoop/Pig job is similar, using the EMR Console. You can alternatively select to run an interactive session and log into the cluster as opposed to submitting just a single job. More details on interactive sessions and individual job flows are available through Amazon’s tutorial articles.
  • Amazon also provides a number of public data sets for analysis on EMR, including a copy of the Common Crawl Corpus of several billion web pages.