2013-04-05-AWS

Table of Contents

1 Amazon Web Services    slide

2 Grant    slide

  • $3900 for this class
  • Store data in S3
  • Use mrjob on EMR

2.1 No crazy    notes

  • Overcharges go on my credit card!
  • Ask me before doing anything not discussed here

3 S3    slide

  • S3: Simple Storage Service
  • Can store and share large data files
  • Free bandwidth when processing with EMR

4 s3cmd    slide

  • Upload / Download files
  • s3cmd --configure
  • s3cmd mb s3://i290-group-name
  • s3cmd put localfile s3://i290-group-name/data/

4.1 Configuration    notes

  • check ~jblomo/mrjob.conf

5 Elastic MapReduce    slide

  • python job.py -r emr -c ~jblomo/mrjob.conf s3://i290-group-name/data/file
  • -r emr : run on EMR instead of localhost
  • -c ~jblomo/mrjob.conf : use a configuration file
    • 5 machines, can change with command line options

6 Copying Keys    slide

  • SSH keys are used to connect to server to check status
  • Copy my keys, set correct permissions
~$ cp ~jblomo/mrjob-common.conf ~/mrjob.conf
~$ chmod 0600 ~/mrjob.conf

7 All Together Now    slide

~$ s3cmd put yelp_academic_dataset.json.gz s3://i290-jblomo/data/
# yelp_academic_dataset.json.gz -> s3://i290-jblomo/data/yelp_academic_dataset.json.gz  [1 of 1]
# 127506871 of 127506871   100% in    8s    13.60 MB/s  done

~$ cd datamining290/code/
~/datamining290/code$ python unique_review.py -v -r emr -c ~jblomo/mrjob.conf --output-dir s3://i290-jblomo/output/unique_review/ --no-output s3://i290-jblomo/data/yelp_academic_dataset.json.gz
# ...
# Creating Elastic MapReduce job flow
# ...
# Job flow created with ID: j-EFG48CIR1APW
# ...
# Job launched 60.8s ago, status STARTING: Provisioning Amazon EC2 capacity
# ...
# Job launched 334.4s ago, status RUNNING: Running step (unique_review.jblomo.20130406.171258.267523: Step 1 of 3)
# ...
# map 73% reduce  42%
# ...
# Counters from step 1:
# ...

~/datamining290/code$ s3cmd ls -r s3://i290-jblomo/output/
# ...
# 2013-04-06 18:05        29   s3://i290-jblomo/output/unique_review/part-00005
# ...

7.1 Trade-offs    notes

  • Jobs may take 5 minutes to spin up
  • Errors are harder to debug because they are mixed in with Hadoop and EMR errors

8 Extra Notes    slide

  • Cannot overwrite output directory: choose a new one for each run
  • Errors may be hard to debug. Run locally with a sample of your data
  • Output from job will be split into files, recall the Hadoop video lecture

Date: 2013-04-06 11:19:43 PDT

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0