2013-02-15-mrjob

Table of Contents

1 Lab: mrjob    slide

  • Understand review_word_count.py
  • Find review with most unique words
    • Fill in unique_review.py
  • Find similar users
    • Write user_similarity.py

1.1 mrjob    notes

  • Using the Yelp Academic Dataset
  • In lecture, we covered the steps for most unique words
  • Use Jaccard similarity for user_similarity

2 Data    slide

  • ischool: ~jblomo/yelp_academic_dataset.json
  • May copy or use in place

2.1 Agreement    notes

3 Understand review_word_count.py    slide

$ python review_word_count.py yelp_academic_dataset.json

no configs found; falling back on auto-configuration
creating tmp directory /tmp/review_word_count.jim.20130215.071901.095847
reading from file
> /home/jim/src/datamining290/code/venv/bin/python review_word_count.py --step-num=0 --mapper /tmp/review_word_count.jim.20130215.071901.095847/input_part-00000
writing to /tmp/review_word_count.jim.20130215.071901.095847/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
...
Streaming final output from /tmp/review_word_count.jim.20130215.071901.095847/output
"4"     2
"5"     1
"50"    1
"6"     2
"7"     2
"70s"   1
"9"     2
"a"     46
"abbey" 4
"able"  1
"about" 4

4 Fill in unique_review.py    slide

  • Mutli-step map reduce
  • Steps are explained in lecture
  • Skeleton in code

5 Write user_similarity.py    slide

  • Find users >= 0.5 similarity
  • User Similarity: Jaccard similarity of businesses reviewed
  • {BizA, BizB, BizC} ~ {BizF, BizB, BizG}

Date: 2013-02-15 13:52:26 PST

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0