2013-02-15-mrjob
Table of Contents
1 Lab: mrjob
- Understand
review_word_count.py
- Find review with most unique words
- Fill in
unique_review.py
- Fill in
- Find similar users
- Write
user_similarity.py
- Write
1.1 mrjob notes
- Using the Yelp Academic Dataset
- In lecture, we covered the steps for most unique words
- Use Jaccard similarity for user_similarity
2 Data
- ischool:
~jblomo/yelp_academic_dataset.json
- May copy or use in place
2.1 Agreement notes
- Data set only for use academic purposes
- Yelp Dataset
3 Understand review_word_count.py
$ python review_word_count.py yelp_academic_dataset.json no configs found; falling back on auto-configuration creating tmp directory /tmp/review_word_count.jim.20130215.071901.095847 reading from file > /home/jim/src/datamining290/code/venv/bin/python review_word_count.py --step-num=0 --mapper /tmp/review_word_count.jim.20130215.071901.095847/input_part-00000 writing to /tmp/review_word_count.jim.20130215.071901.095847/step-0-mapper_part-00000 Counters from step 1: (no counters found) ... Streaming final output from /tmp/review_word_count.jim.20130215.071901.095847/output "4" 2 "5" 1 "50" 1 "6" 2 "7" 2 "70s" 1 "9" 2 "a" 46 "abbey" 4 "able" 1 "about" 4
4 Fill in unique_review.py
- Mutli-step map reduce
- Steps are explained in lecture
- Skeleton in code
5 Write user_similarity.py
- Find users >= 0.5 similarity
- User Similarity: Jaccard similarity of businesses reviewed
- {BizA, BizB, BizC} ~ {BizF, BizB, BizG}