HW12-mrjob
Table of Contents
1 Homework
- Analyze site traffic
- Transform web logs
- Analyze user behavior
- Analyze site stats
2 Data
- Anonymous web data from www.microsoft.com
- Contains information in CSV format
- use mrjob MapReduce Framework to find answers
2.1 Format notes
- Part of the task is to undertsand the data format given the website
- This is actually more information than you're typically given at a job
3 mrjob
from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences) if __name__ == '__main__': MRWordCounter.run()
More documentation: http://pythonhosted.org/mrjob/
3.1 Relation to HW notes
mapper
takes keys and valuesreducer
takes the keys output by the mapper, and all relevant valuessplit
takes a string and splits on spaces, giving wordsyield
essentially returns <key,value> pairs, but can be called more than once
3.2 Output
python data/top_pages.py < data/anonymous-msweb.data no configs found; falling back on auto-configuration creating tmp directory /tmp/top_pages.jim.20121116.052647.278066 ... reading from STDIN writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-mapper Counters from step 1: (no counters found) writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-mapper-sorted writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-reducer Counters from step 1: (no counters found) Moving /tmp/top_pages.jim.20121116.052647.278066/step-0-reducer -> /tmp/top_pages.jim.20121116.052647.278066/output/part-00000 Streaming final output from /tmp/top_pages.jim.20121116.052647.278066/output "1000" 912 "1001" 4451 "1002" 749 "1003" 2968 "1004" 8463 "1007" 865 ...
3.3 Running notes
- Run with python
- Output some debugging information while it is calculating
- Finally, output results
4 Pages with > 400 visits
- Find pages (aka Vroots) with more than 400 visits
- Start off with a template
4.1 Demo notes
csv_readline
takes in a CSV line, return a list of values
5 Transform Data
- MapReduce needs all information on one line
- This data format has user information on different line than visit
- Write single threaded program (not mrjob) to transform it
5.1 Demo notes
- fill in blanks
- What does the
>
character do? (redirects output)
6 Users > 20 visits
- Now that we have the new data file
- Find users with > 20 visits
- Must create this program from scratch
7 Most Common Title Words
- Vroots (pages) have titles
- What are the 10 most common title words?
- Must create this program from scratch
8 Extra Credit
- What titles were most browsed?
- Which URLs co-occured most frequently?
9 Submit via git
- Update programs
- Write new programs
- Save output of running mrjobs into
data/program_name.out
git add
programs, output
10 Review
- Fill in
top_pages.py
- Fill in
combine_user_visits.py
- Write
top_users.py
- Write
top_title_words.py
- Run programs to create outputs:
top_pages.out
user-visits_msweb.data
top_users.out
top_title_words.out
11 Moved Due Date
- Dec 5
- But start now!
- This is probably the most coding intensive assignment
- Aim to get at least
top_pages.py
done by Thanksgiving