These two lectures introduced network theory, following the first part of “Networks, Crowds, and Markets” by Easley & Kleinberg. We discussed basic representations of relational data, basic algorithms for computations on graphs, and common features of real-world networks.
Slides are below. See “Predicting Individual Behavior with Social Networks “ and “The Structure of Online Diffusion Networks “ for more details of the case studies we covered, and the references that follow for information on using Pig and Amazon’s computing services.
- Hadoop is an open source framework for scalable storage and computing, including a distributed filesystem and scalable implementation of MapReduce.
- Pig is a high-level language that converts sequences of common data analysis operations (e.g., group-bys, joins, etc.) to chains of MapReduce jobs and executes these either locally or across a Hadoop cluster.
- Pig is easy to download and try on Linux, Mac OS X, or Windows (requires Java). See
RELEASE_NOTES.txtfor information on trying Pig in local mode.
- The data types in Pig are slightly different than in most languages, including bags, tuples, and fields.
- The diagnostic operators in Pig are useful for understanding code and debugging. In particular, the
DUMPoperator shows contents of relations and the
ILLUSTRATEoperator shows how relations are transformed through a Pig program.
- Programming Pig is a freely available book with an accompanying GitHub page of examples.
- In addition to running Hadoop and Pig locally, Amazon’s Elastic Compute Cloud (EC2) and Scalable Storage Service (S3) provide pay-per-use computation and storage that can be used to construct rentable clusters for large-scale data analysis.
- Elastic MapReduce (EMR) facilitates this process, allowing one to easily submit MapReduce jobs without needing to configure low-level cluster details.
- Amazon’s getting started guide walks through the details of creating an account and launching/logging into a simple Linux instance. Be sure to create a keypair for this last step with the AWS console. You can use the
sshcommand to log in to machines on Linux, Mac OS X, or in Cygwin on Windows; alternatively you can use the AWS console or Putty to access remote machines.
- The process for running a Hadoop/Pig job is similar, using the EMR Console. You can alternatively select to run an interactive session and log into the cluster as opposed to submitting just a single job. More details on interactive sessions and individual job flows are available through Amazon’s tutorial articles.
- Amazon also provides a number of public data sets for analysis on EMR, including a copy of the Common Crawl Corpus of several billion web pages.