2013-04-05-AWS
Table of Contents
1 Amazon Web Services
2 Grant
- $3900 for this class
- Store data in S3
- Use mrjob on EMR
2.1 No crazy notes
- Overcharges go on my credit card!
- Ask me before doing anything not discussed here
3 S3
- S3: Simple Storage Service
- Can store and share large data files
- Free bandwidth when processing with EMR
4 s3cmd
- Upload / Download files
s3cmd --configure
s3cmd mb s3://i290-group-name
s3cmd put localfile s3://i290-group-name/data/
4.1 Configuration notes
- check
~jblomo/mrjob.conf
5 Elastic MapReduce
python job.py -r emr -c ~jblomo/mrjob.conf s3://i290-group-name/data/file
-r emr
: run on EMR instead of localhost-c ~jblomo/mrjob.conf
: use a configuration file- 5 machines, can change with command line options
6 Copying Keys
- SSH keys are used to connect to server to check status
- Copy my keys, set correct permissions
~$ cp ~jblomo/mrjob-common.conf ~/mrjob.conf ~$ chmod 0600 ~/mrjob.conf
7 All Together Now
~$ s3cmd put yelp_academic_dataset.json.gz s3://i290-jblomo/data/ # yelp_academic_dataset.json.gz -> s3://i290-jblomo/data/yelp_academic_dataset.json.gz [1 of 1] # 127506871 of 127506871 100% in 8s 13.60 MB/s done ~$ cd datamining290/code/ ~/datamining290/code$ python unique_review.py -v -r emr -c ~jblomo/mrjob.conf --output-dir s3://i290-jblomo/output/unique_review/ --no-output s3://i290-jblomo/data/yelp_academic_dataset.json.gz # ... # Creating Elastic MapReduce job flow # ... # Job flow created with ID: j-EFG48CIR1APW # ... # Job launched 60.8s ago, status STARTING: Provisioning Amazon EC2 capacity # ... # Job launched 334.4s ago, status RUNNING: Running step (unique_review.jblomo.20130406.171258.267523: Step 1 of 3) # ... # map 73% reduce 42% # ... # Counters from step 1: # ... ~/datamining290/code$ s3cmd ls -r s3://i290-jblomo/output/ # ... # 2013-04-06 18:05 29 s3://i290-jblomo/output/unique_review/part-00005 # ...
7.1 Trade-offs notes
- Jobs may take 5 minutes to spin up
- Errors are harder to debug because they are mixed in with Hadoop and EMR errors
8 Extra Notes
- Cannot overwrite output directory: choose a new one for each run
- Errors may be hard to debug. Run locally with a sample of your data
- Output from job will be split into files, recall the Hadoop video lecture