Homework 2

The second homework is posted.

The first problem is a simple word count exercise over the Wikipedia corpus, the second examines Wikipedia page popularity, and the third explores tie strength between co-authors.

See Amazon’s getting started videos the references from lectures 5 and 6 for more information on Pig, EC2, and Elastic Mapreduce.

Some tips:

  1. Use the template solution files to test and debug Pig scripts on your local machine.

  2. Create a bucket with a unique name (e.g., your UNI) using the S3 console:
    Step 1

  3. Upload your locally tested Pig script to S3:
    Step 2

  4. Create a Pig job flow in the Elastic Mapreduce console:
    Step 3

  5. Specify the path to your Pig script on S3, along with input and output paths:
    Step 4

  6. Select the number of instances (5 small instances should be sufficient):
    Step 5

  7. Specify a log path for debugging and a keypair if you’d like to log into the cluster while the job is running:
    Step 6

  8. To avoid an error in allocating heap space for Java when the job starts, select the “Memory Intensive Configuration” bootstrap script:
    Step 7

  9. Review job details and submit the job:
    Step 8

  10. Monitor the job status through the Elastic Mapreduce console or log into the machine with ssh (or Putty) and check the JobTracker with lynx:

    ssh -i /path/to/keypair.pem hadoop@ec2-xxx.compute-1.amazonaws.com
    $ lynx http://localhost:9100