The second homework is posted.
The first problem is a simple word count exercise over the Wikipedia corpus, the second examines Wikipedia page popularity, and the third explores tie strength between co-authors.
See Amazon’s getting started videos the references from lectures 5 and 6 for more information on Pig, EC2, and Elastic Mapreduce.
Some tips:
-
Use the template solution files to test and debug Pig scripts on your local machine.
-
Create a bucket with a unique name (e.g., your UNI) using the S3 console:

-
Upload your locally tested Pig script to S3:

-
Create a Pig job flow in the Elastic Mapreduce console:

-
Specify the path to your Pig script on S3, along with input and output paths:

-
Select the number of instances (5 small instances should be sufficient):

-
Specify a log path for debugging and a keypair if you’d like to log into the cluster while the job is running:

-
To avoid an error in allocating heap space for Java when the job starts, select the “Memory Intensive Configuration” bootstrap script:

-
Review job details and submit the job:

-
Monitor the job status through the Elastic Mapreduce console or log into the machine with ssh (or Putty) and check the JobTracker with lynx:
ssh -i /path/to/keypair.pem hadoop@ec2-xxx.compute-1.amazonaws.com
$ lynx http://localhost:9100