HW12-mrjob

Table of Contents

1 Homework   slide

  • Analyze site traffic
  • Transform web logs
  • Analyze user behavior
  • Analyze site stats

2 Data   slide

2.1 Format   notes

  • Part of the task is to undertsand the data format given the website
  • This is actually more information than you're typically given at a job

3 mrjob   slide

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()

More documentation: http://pythonhosted.org/mrjob/

3.1 Relation to HW   notes

  • mapper takes keys and values
  • reducer takes the keys output by the mapper, and all relevant values
  • split takes a string and splits on spaces, giving words
  • yield essentially returns <key,value> pairs, but can be called more than once

3.2 Output   slide

python data/top_pages.py < data/anonymous-msweb.data
no configs found; falling back on auto-configuration
creating tmp directory /tmp/top_pages.jim.20121116.052647.278066
...
reading from STDIN
writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-mapper
Counters from step 1:
  (no counters found)
writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-mapper-sorted
writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-reducer
Counters from step 1:
  (no counters found)
Moving /tmp/top_pages.jim.20121116.052647.278066/step-0-reducer -> /tmp/top_pages.jim.20121116.052647.278066/output/part-00000
Streaming final output from /tmp/top_pages.jim.20121116.052647.278066/output
"1000"  912
"1001"  4451
"1002"  749
"1003"  2968
"1004"  8463
"1007"  865
...

3.3 Running   notes

  • Run with python
  • Output some debugging information while it is calculating
  • Finally, output results

4 Pages with > 400 visits   slide

  • Find pages (aka Vroots) with more than 400 visits
  • Start off with a template

4.1 Demo   notes

  • csv_readline takes in a CSV line, return a list of values

5 Transform Data   slide

  • MapReduce needs all information on one line
  • This data format has user information on different line than visit
  • Write single threaded program (not mrjob) to transform it

5.1 Demo   notes

  • fill in blanks
  • What does the > character do? (redirects output)

6 Users > 20 visits   slide

  • Now that we have the new data file
  • Find users with > 20 visits
  • Must create this program from scratch

7 Most Common Title Words   slide

  • Vroots (pages) have titles
  • What are the 10 most common title words?
  • Must create this program from scratch

8 Extra Credit   slide

  • What titles were most browsed?
  • Which URLs co-occured most frequently?

9 Submit via git   slide

  • Update programs
  • Write new programs
  • Save output of running mrjobs into data/program_name.out
  • git add programs, output

10 Review   slide

  • Fill in top_pages.py
  • Fill in combine_user_visits.py
  • Write top_users.py
  • Write top_title_words.py
  • Run programs to create outputs: top_pages.out user-visits_msweb.data top_users.out top_title_words.out

11 Moved Due Date   slide

  • Dec 5
  • But start now!
  • This is probably the most coding intensive assignment
  • Aim to get at least top_pages.py done by Thanksgiving

Author: Jim Blomo

Created: 2014-11-21 Fri 01:32

Emacs 23.4.1 (Org mode 8.2.4)

Validate