2013-04-05-AWS

1 Amazon Web Services
2 Grant
- 2.1 No crazy
3 S3
4 s3cmd
- 4.1 Configuration
5 Elastic MapReduce
6 Copying Keys
7 All Together Now
- 7.1 Trade-offs
8 Extra Notes

1 Amazon Web Services slide

2 Grant slide

$3900 for this class
Store data in S3
Use mrjob on EMR

2.1 No crazy notes

Overcharges go on my credit card!
Ask me before doing anything not discussed here

3 S3 slide

S3: Simple Storage Service
Can store and share large data files
Free bandwidth when processing with EMR

4 `s3cmd` slide

Upload / Download files
s3cmd --configure
s3cmd mb s3://i290-group-name
s3cmd put localfile s3://i290-group-name/data/

4.1 Configuration notes

check ~jblomo/mrjob.conf

5 Elastic MapReduce slide

python job.py -r emr -c ~jblomo/mrjob.conf s3://i290-group-name/data/file
-r emr : run on EMR instead of localhost
-c ~jblomo/mrjob.conf : use a configuration file
- 5 machines, can change with command line options

6 Copying Keys slide

SSH keys are used to connect to server to check status
Copy my keys, set correct permissions

~$ cp ~jblomo/mrjob-common.conf ~/mrjob.conf
~$ chmod 0600 ~/mrjob.conf

7 All Together Now slide

~$ s3cmd put yelp_academic_dataset.json.gz s3://i290-jblomo/data/
# yelp_academic_dataset.json.gz -> s3://i290-jblomo/data/yelp_academic_dataset.json.gz  [1 of 1]
# 127506871 of 127506871   100% in    8s    13.60 MB/s  done

~$ cd datamining290/code/
~/datamining290/code$ python unique_review.py -v -r emr -c ~jblomo/mrjob.conf --output-dir s3://i290-jblomo/output/unique_review/ --no-output s3://i290-jblomo/data/yelp_academic_dataset.json.gz
# ...
# Creating Elastic MapReduce job flow
# ...
# Job flow created with ID: j-EFG48CIR1APW
# ...
# Job launched 60.8s ago, status STARTING: Provisioning Amazon EC2 capacity
# ...
# Job launched 334.4s ago, status RUNNING: Running step (unique_review.jblomo.20130406.171258.267523: Step 1 of 3)
# ...
# map 73% reduce  42%
# ...
# Counters from step 1:
# ...

~/datamining290/code$ s3cmd ls -r s3://i290-jblomo/output/
# ...
# 2013-04-06 18:05        29   s3://i290-jblomo/output/unique_review/part-00005
# ...

7.1 Trade-offs notes

Jobs may take 5 minutes to spin up
Errors are harder to debug because they are mixed in with Hadoop and EMR errors

8 Extra Notes slide

Cannot overwrite output directory: choose a new one for each run
Errors may be hard to debug. Run locally with a sample of your data
Output from job will be split into files, recall the Hadoop video lecture

2013-04-05-AWS

Table of Contents

1 Amazon Web Services slide

2 Grant slide

2.1 No crazy notes

3 S3 slide

4 s3cmd slide

4.1 Configuration notes

5 Elastic MapReduce slide

6 Copying Keys slide

7 All Together Now slide

7.1 Trade-offs notes

8 Extra Notes slide

4 `s3cmd` slide