Category Archives: homework

Homework 2

The second homework is posted.

The first problem is a simple word count exercise over the Wikipedia corpus, the second examines Wikipedia page popularity, and the third explores tie strength between co-authors.

See Amazon’s getting started videos the references from lectures 5 and 6 for more information on Pig, EC2, and Elastic Mapreduce.

Some tips:

  1. Use the template solution files to test and debug Pig scripts on your local machine.

  2. Create a bucket with a unique name (e.g., your UNI) using the S3 console:
    Step 1

  3. Upload your locally tested Pig script to S3:
    Step 2

  4. Create a Pig job flow in the Elastic Mapreduce console:
    Step 3

  5. Specify the path to your Pig script on S3, along with input and output paths:
    Step 4

  6. Select the number of instances (5 small instances should be sufficient):
    Step 5

  7. Specify a log path for debugging and a keypair if you’d like to log into the cluster while the job is running:
    Step 6

  8. To avoid an error in allocating heap space for Java when the job starts, select the “Memory Intensive Configuration” bootstrap script:
    Step 7

  9. Review job details and submit the job:
    Step 8

  10. Monitor the job status through the Elastic Mapreduce console or log into the machine with ssh (or Putty) and check the JobTracker with lynx:

    ssh -i /path/to/keypair.pem hadoop@ec2-xxx.compute-1.amazonaws.com
    $ lynx http://localhost:9100

Homework 1

The first homework is posted.

The first problem looks at the impact of inventory size on customer satisfaction for the MovieLens data, the second is an exercise in simple streaming calculations, and the third explores various counting scenarios.

A script to download the data for the first question as well as a solution template for the second are available on the course GitHub page.