MapReduce

1. Snakes on a Hadoop
2. Let's Jump In
- 2.1. Versions
3. Agenda
- 3.1. Sections
4. Interest?
- 4.1. Audience participation!
5. Run a Test
6. Yelp has a problem
- 6.1. Too long
7. Solution?
- 7.1. New Challenges
8. Do It Yourself
- 8.1. Dependencies
9. Big Idea
- 9.1. Really Big Idea
10. MapReduce
- 10.1. Reading
- 10.2. Example
  - 10.2.1. Details
- 10.3. Map
- 10.4. Map Example
  - 10.4.1. Counts
- 10.5. Status?
  - 10.5.1. Middle Step
- 10.6. Reduce
  - 10.6.1. Value*s*
- 10.7. Reduce Example
  - 10.7.1. Details
- 10.8. Example Output
11. Point?
- 11.1. Details
- 11.2. Implementation
- 11.3. Intermediate
- 11.4. "Shuffle"
- 11.5. Distribute
- 11.6. Inputs
  - 11.6.1. Splitting Files
- 11.7. Incorrect Log Style
  - 11.7.1. No!
- 11.8. Correct Log Style
12. Word Count
- 12.1. Another example
13. mrjob
- 13.1. Relation to example
14. Data
- 14.1. Format
- 14.2. Output
- 14.3. Summary
- 14.4. Output
- 14.5. Running
15. Pages with > 400 visits
- 15.1. Demo
16. BREAK
17. Solution
18. Discussion: User Visits
19. Transform Data
- 19.1. Not covered
20. Most Common Title Words
21. Joins
22. Running Programatically
23. MRJob Class
- 23.1. Example
24. BREAK
25. Solution
- 25.1. Other possibilities
26. Yelp Data
- 26.1. More Data
27. Multi-Step
- 27.1. Output as Input
- 27.2. Examples
- 27.3. PageRank
28. Unique Review
- 28.1. Questions
- 28.2. Step 2: Count Unique Words
- 28.3. Questions
29. INPUT PROTOCOL
- 29.1. Examples
30. Protocols
- 30.1. Writing your own
31. Protocols and InputFormats
- 31.1. Limitations
32. BREAK
33. Solution
34. Quantifying Intuitions
- 34.1. Skills
35. Simple: Two Points
- 35.1. Start
36. Jaccard Coefficient
- 36.1. Jaccard
37. Jaccard of Users
- 37.1. Answers
38. Write user\_similarity.py
- 38.1. Input
39. BREAK
40. Solution
- 40.1. Efficiency
41. EMR
42. Elastic MapReduce
- 42.1. Nomenclature
43. mrjob provides
- 43.1. Benefits
44. Settings
- 44.1. Info
45. Bootstrap vs Setup
- 45.1. Setup vs Bootstrap
46. Book Keeping
47. Tracking
48. EMR Pricing
- 48.1. Typical usage
- 48.2. Job Flow Pooling
  - 48.2.1. Pooling
49. How To
50. mrjob Workflow
51. Thank You!

1 Snakes on a Hadoop slide

PyCon 2014
Jim Blomo
https://jblomo.github.io/pycon-mrjob/slides/MapReduce.html#sec-1

2 Let's Jump In slide

git clone https://github.com/jblomo/pycon-mrjob.git
cd pycon-mrjob
virtualenv venv
. venv/bin/activate
pip install -r requirements.txt

2.1 Versions notes

All tests were done with Python 2.7
This is what is used on AWS EMR, so we're going to try to keep versions compatible.

3 Agenda slide

What is MapReduce?
Writing MapReduce in mrjob
Exercise 1, user activity
Joins and Programmatic usage
Exercise 2, self-joins
Multi-step jobs & Protocols
Exercise 3, finding Yelp reviews with unique words
Exploring user-user similarity
Exercise 4, putting it all together
Deploying on AWS

3.1 Sections notes

I will give some background, answer questions
I'll introduce a script
The exercise will have audience extending the script to get it working
We'll go over an example solution

4 Interest? slide

What are you interested in?
What are you hoping to get out of this talk?
Introduce yourself to neighbor

4.1 Audience participation! notes

This will help focus this talk
You'll run into bugs, neighbor can be a good person to ask!
Name, years using Python, favorite Python function
I'll be helping out as much as I can, but help each other out first!

5 Run a Test slide

echo hello | \
python \
venv/lib/python2.7/site-packages/mrjob/examples/mr_word_freq_count.py

6 Yelp has a problem slide

250+ GB of logs per day
Each GB takes 10 minutes to process
How long to handle a day's logs?

6.1 Too long notes

On a single machine 40+ hours!
If we really had only a single machine, we wouldn't be able to keep up!
Mistake can't be fixed in a day (billing especially important)

7 Solution? slide animate

Don't use one machine!
What are the new challenges?
Distributing data
Calculating overall statistics
Failures

7.1 New Challenges notes

With many machines, how to they get access to the 100 GB of logs?
How do they coordinate who gets which section of logs?
How do we calculate the average?
What happens when one of the boxes dies?
- Detecting failure (timeout waiting for data? Out of band?)
- Decide who takes over the data

8 Do It Yourself slide two_col

There are many ways to deal with these challenges
Often, people would "roll" their own solutions depending on the problem
Google implemented a generic solution, shared idea

8.1 Dependencies notes

Did you have a super-computer?
What programming language were you using?
Type of problem being solved (working on graphs, or web logs, …)

9 Big Idea slide

Simplify, limit solution expression
Enable sophisticated implementation

Interface: Map() Reduce()
Implementation: Reliably run over 1000s of machines

9.1 Really Big Idea notes

Limiting yourself to what can be expressed may seem like a loss
But it enables the implementation to handle the problems we talked about
And then can be used as understandable building blocks

10 MapReduce slide

Map: Extract a property to summarize over
Reduce: Summarize all items with a particular property

Simple: Each operation stateless

10.1 Reading notes

MapReduce's main benefits are for running over many machines, fault tolerance
But we'll just practice on one machine
Then see how to run in the cloud

10.2 Example slide

URL Shortener
How many actions have we seen?
Redirects: 450, Saves: 40, Loads: 60

10.2.1 Details notes

Redirects: How many times have we expanded a short link to a long one?
Saves: How many times have we saved a new URL?
Loads: How many times have we just loaded the front page?
First: So first step in MapReduce is what?

10.3 Map slide

Input: Key, Value
Output: Keys, Values

10.4 Map Example slide

Input Key: Log line number
Input Value: Log line text
Output Key: Action
Output Value: times this action has occurred on this line

10.4.1 Counts notes

Log line number is not helpful in our specific case
Log line text: we hope it is machine readable so we can accurately extract the action
It has datetime, cookie, action, etc.
How many times has this action occurred? 1

10.5 Status? slide

load            1
save            1
redirect        1
redirect        1
load            1
redirect        1
load            1
save            1
redirect        1

10.5.1 Middle Step notes

From log lines, we've extracted the information out that we care about
The counts and the actions
Next step summarize
Next step after Map?

10.6 Reduce slide

Input: Key, Values
Output: Keys, Values

10.6.1 Values notes

Note: The input is values! Plural
Because we get a key and all of its associated values
Remind me: what are we trying to get out of this computation?
So what do you think the output keys are?
Values?

10.7 Reduce Example slide

Input Key: Action
Input Values: Counts: [1,1,1,1]
Output Key: Action
Output Value: Total Count

10.7.1 Details notes

Action is one of load save redirect
To get total count, sum all of the counts

10.8 Example Output slide

Output Key: Action
Output Value: Total Count

"redirect"  4
"save"      2
"load"      3

11 Point? slide

A lot of work for counting!
More complex calculations can be done this way, eg. PageRank
Stateless constraint means it can be used across thousands of computers

11.1 Details notes

By only looking at keys and values, can optimize a lot of backend work
Where to send the results?
What to do when a computer fails? (Just restart failed part)

11.2 Implementation slide

load            1
save            1
redirect        1
redirect        1
load            1
redirect        1
load            1
save            1
redirect        1

11.3 Intermediate notes

This was the situation after map
Keys all jumbled
What Hadoop does is sort them and distribute them to computers

11.4 "Shuffle" slide

load            1
load            1
load            1
redirect        1
redirect        1
redirect        1
redirect        1
save            1
save            1

11.5 Distribute notes

Now it is easy to distribute, and can handle all the load at once
When mrjob is running, you'll notice debugging output, including a "sort" step. This is why.

11.6 Inputs slide

MapReduce distributes computing power by distributing input
Input is distributed by splitting on lines (records)
You cannot depend on lines being "together" in MapReduce

11.6.1 Splitting Files notes

Image you have a lot of large log files, GB each
You'd like to let different machines work on the same file
Split file down the middle, well, at least on a newline
Enable two separate machines to work on the parts
You don't know what line came before this one
You don't know if you will process the next line
Only view is this line
Real life slightly more complicated, but mostly hacks around this

11.7 Incorrect Log Style slide

URL Shortener logging

app.logger.info("Handling request for " + cookie)
...
# find redirect
...
app.logger.info("Redirecting to " + destination)

Ability to associate the redirect with the cookie?
- eg. which cookie had the most redirects?

11.7.1 No! notes

Must log everything on same line
One machine could have the "Handling request.." Another could have "Redirecting to…"
Collect it all, then log it

11.8 Correct Log Style slide

Logging once

log_data['cookie'] = cookie
...
# find redirect
log_data['action'] = 'redirect'
app.logger.info(json.dumps(log_data))

12 Word Count slide

Example: frequency of every word used in a text
Map: Extract the words from a line, yield the number of times we've seen it
Reduce: For each word, summarize those counts

12.1 Another example notes

Let's take a look at another example
instead of counting actions, 1 per line, now we want to count words
potentially many per line
Same general idea though. Tunnel vision on one line (even one word!)

13 mrjob slide

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()

More documentation: http://pythonhosted.org/mrjob/

13.1 Relation to example notes

mapper takes keys and values
reducer takes the keys output by the mapper, and all relevant values
split takes a string and splits on spaces, giving words
yield essentially returns <key,value> pairs, but can be called more than once

14 Data slide

Anonymous web data from www.microsoft.com
Contains information in CSV format
use mrjob MapReduce Framework to find answers

14.1 Format notes

Open up file
File tells us information about a "Vroot"
Vroot is just a fancy name for a page with a numeric ID
each line is different information about the vroot:
A: attributes of a vroot (id, title, url)
C: customer information, following which are the Visits that customer made to the vroot
V: a visit to a vroot (id)
Before we talked about actions of redirecting, saving, etc. Now we have a lot more different actions: each vroot with an ID

14.2 Output slide

vroot ID => number of visits
Group by the vroot ID
Summarize all of the visits

14.3 Summary notes

and what does it look to actually run this?

14.4 Output slide

python code/top_pages.py msanon/anonymous-msweb.data.gz
no configs found; falling back on auto-configuration
creating tmp directory /tmp/top_pages.jim.20121116.052647.278066
...
reading from STDIN
writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-mapper
Counters from step 1:
  (no counters found)
writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-mapper-sorted
writing to /tmp/top_pages.jim.20121116.052647.278066/step-0-reducer
Counters from step 1:
  (no counters found)
Moving /tmp/top_pages.jim.20121116.052647.278066/step-0-reducer -> /tmp/top_pages.jim.20121116.052647.278066/output/part-00000
Streaming final output from /tmp/top_pages.jim.20121116.052647.278066/output
"1000"  912
"1001"  4451
"1002"  749
"1003"  2968
"1004"  8463
"1007"  865
...

14.5 Running notes

Run with python
Output some debugging information while it is calculating
Finally, output results

15 Pages with > 400 visits slide

Find pages (aka Vroots) with more than 400 visits
Start off with a template

15.1 Demo notes

csv_readline takes in a CSV line, return a list of values

16 BREAK slide

Please complete exercise 1
Fill in code/top_pages.py
Take in web browsing data, output vroot to number of visits

17 Solution slide

git commit -m "my solution" -a
git checkout ex1
yield the vroot ID and the number of times seen in that line (1)
sum the times seen, check threshold, yield pair

18 Discussion: User Visits slide animate

Can we calculate the number of visits per user?
No! User information on different line
Cannot assume linear processing

19 Transform Data slide

MapReduce needs all information on one line
This data format has user information on different line than visit
Write single threaded program (not mrjob) to transform it

19.1 Not covered notes

This is a little out of the scope of this tutorial
Just know that not all types of data are amenable to MapReduce, and

Hadoop/mrjob specifically

20 Most Common Title Words slide

Vroots (pages) have titles
What are the titles of the 10 most browsed nodes?

21 Joins slide

Sometimes need to join against two data sets
In our case, it is the same file, different record types
mrjob can pass serializable objects between Map and Reduce steps
You can "tag" values with identifiers, eg. title or visit

A,1012,1,"Outlook Development","/outlookdev"
...
V,1012,1

22 Running Programatically slide

We've been running in the command line
How to process the output of an MRJob inline
Testing, insert results into a database, do further processing on all of result

23 MRJob Class slide

mr_job = MRJobSubclass(args=[...])
with mr_job.make_runner() as runner:
    runner.run()
    runner.stream_output()

23.1 Example notes

code/top\_titles.py

24 BREAK slide

Please complete exercise 2
What are the titles of the 10 most browsed nodes?
Fill in code/count_titles.py
Examine and run with code/top_titles.py

25 Solution slide

git checkout ex2
yield a tuple tagged with either 'A' or 'V'
When iterating through values, extract the data from the tagged tuple

25.1 Other possibilities notes

Can tag with a dictionary instead

26 Yelp Data slide

Subset of data available in yelp/selected_reviews.json.gz

{"votes": {"funny": 3, "useful": 3, "cool": 2},
 "user_id": "dYtkphUrU7S2_bjif6k2uA",
 "review_id": "gjtWdiEMMfoOTCfdd3hPmA",
 "stars": 4,
 "date": "2009-04-24",
 "text": "Man, if these guys were trying to replicate a gringo bar in Mexico...",
 "type": "review",
 "business_id": "RqbSeoeqXTwts5pfhw7nJg"}

26.1 More Data notes

You may run with more data by downloading and extracting from http://www.yelp.com/dataset_challenge

27 Multi-Step slide

Not all computations can be done in a single MapReduce step
Map Input: <key, value>
Reducer Output: <key, value>
Compose MapReduce steps!

27.1 Output as Input notes

The output of one MapReduce job can be used as the input to another

27.2 Examples slide

PageRank: Multiple steps till solution converges
Multi-level summaries

27.3 PageRank notes

PageRank is an algorithm for calculating the important of a page
But it depends on the importance of every page pointing to it!
So iteratively calculate the important of all pages
Find average presidential donations by candidate, then normalize averages

28 Unique Review slide animate

Count unique words per review
Map Input: <line number, review dictionary>
Map Output: <word, review-id>
Reducer Input: <word, [review-ids]>
Reducer Output: <review-id, 1> if the word is unique

28.1 Questions notes

For our purposes, what is always the mapper input?
What feature do we want to calculate first?
Given this mapper output, what must the reducer input be?
What property about a review are we interested in?

28.2 Step 2: Count Unique Words slide animate

Map Input: <review-id, 1>
Map Output: <review-id, 1>
Reducer Input: <review-id, [1,1,…]>
Reducer Output: <review-id, sum>

28.3 Questions notes

Given the reducer output, what must the mapper input be (for chained MapReduce steps)
What do we want to group by?
Given this mapper output, what must the reducer input be?
What are we calculating?

29 INPUT PROTOCOL slide

Previous examples manually parsed line
mrjob provides "protocols" to automatically parse and recover from error
Specify the class desired to parse the lines

29.1 Examples notes

Open up code/unique_review.py
We're using a JSON object per line. Other options include JSON key-values,

or writing your own

Instead of raw text, record will now be a Python dict

30 Protocols slide

class JSONValueProtocol(object):
    """Encode ``value`` as a JSON and discard ``key``
    (``key`` is read in as ``None``).
    """
    def read(self, line):
        return (None, json.loads(line))

    def write(self, key, value):
        return json.dumps(value)

30.1 Writing your own notes

Protocols are straightforward to write
At their heart contain read and write method
For example, we've implemented a Protocol for MySQL INSERT statements, so

can easily process data created by mysqldump

31 Protocols and InputFormats slide

mrjob operates on lines of tab delimited text
InputFormat translates data stored on HDFS to Text
Protocols translates a line into a Python variable

31.1 Limitations notes

line oriented is limitation of Hadoop Streaming
It sends data through stdin/stdout. Sometimes people ask if they can process an XML file with arbitrary line breaks
Not what mrjob is good at. Hadoop may break at any point without extra

configuration

Some of you may be familiar with Hadoop's InputFormat or OutputFormat.

They serve a slightly different purpose than Protocols

InputFormats can be used to extract data which is not text on disk, for

example SequenceFiles, or from data which is not simply stored in a directory, like providing a Manifest file.

32 BREAK slide

Please complete exercise 3
Fill in code/unique_review.py
Output a review ID with how many unique words it used
Run with data in yelp/selected_reviews.json

33 Solution slide

Extract words from the review comment field
Group them by word, keep those that are unique
Group them by review, return sum

34 Quantifying Intuitions slide

Data Scientists translate ideas into math
Often want to suggest "similar" items or users
But what does "similar" mean?

34.1 Skills notes

One of the core skill of a DS is translating an intuition into a

quantifiable entity

An example is for recommendation systems. If you want to find people to

follow, or friends to recommend, you often want to find "similar" people to yourself

But what does similar mean? (Ask)
In order to recommend one product over another, or sort in any useful way,

you need to associate a number with that similarity

35 Simple: Two Points slide animate

Sometimes can translate entities to points
What is a similarity metric between these two?
Euclidean distance
sqrt((x₁-x₂)**2 + (y₁-y₂)**2)

35.1 Start notes

For example intracity geographic distance
Or house price and square feet
Called "vector space" because a point also describes a vector, which you

can get interesting information about (magnitude, angles, etc)

Simple starting point: Euclidean distance
Has many downsides, but is an example of translating a concept of

similarity to a number

Interestingly the notions of distance and similarity are two sides of the

same coin.

Also called the L₂ norm because we're taking the second root. We could

cube and take the cube root… leads to different semantics

How to measure if points don't map so easily to this "vector space"?

36 Jaccard Coefficient slide

Commonly used for calculating set similarity
|intersection| / |union|
"Jim likes pizza" | "Shreyas likes pizza"

36.1 Jaccard notes

Break up into a set
calculate # in intersection
calculate # in union
divide

37 Jaccard of Users slide

We take the set of businesses reviewed by two users
Value if users have no businesses in common?
All businesses?

37.1 Answers notes

None = 0
All = 1

38 Write user\_similarity.py slide

Find users >= 0.5 similarity
User Similarity: Jaccard similarity of businesses reviewed
{BizA, BizB, BizC} ~ {BizF, BizB, BizG}

38.1 Input notes

Takes as input the list of reviews that we have
You'll be more on your own for this one, very limited skeleton

39 BREAK slide

Please complete exercise 4
Fill in code/user\_similarity.py
Run with data in yelp/selected_reviews.json
Hint: Users with > 0 similarity will share at least 1 business

40 Solution slide

Map user to list of businesses reviewed
Map business to list of user descriptions
Find users with similarity
De-duplicate

40.1 Efficiency notes

We do duplicate work, but the trade-off is parallelism in the best case

(not many users are overlapping with each other for dozens of reviews)

41 EMR slide

We've been using local mode
Effective for debugging, small amounts of data
What about actually processing Big Data?

42 Elastic MapReduce slide

Hadoop as a service by Amazon
Installs and sets up Hadoop cluster
JobFlow: instance of this cluster
Jobs: one command with series of MapReduce steps

42.1 Nomenclature notes

API to bring up a cluster of N machines
Amazon provides tech for registering them all together, running Bootstrap

commands, terminating them, etc

This cluster is called a JobFlow
We'll discuss what it means to have one JobFlow with multiple Jobs in a bit

43 mrjob provides slide

Encode standard settings as config file or arguments
Uploads and runs bootstrap commands w/ local integration
Transparent inter-job bookkeeping
Tracking, error collection, job sharing

43.1 Benefits notes

A config file is easy to install across developers. Yelp has 3 main ones:

standard, memory-intensive, and cpu-intensive.

Otherwise all of your developers must learn the different options for the

AWS API

Often you'll need to make files available to EMR via S3, for example custom

packages you'd like to install. EMR cannot reach back out to your localhost. mrjob will automate the uploading and management of these files.

mrjob "tags" JobFlows with information about the user running them, and has

tools for aggregating those metrics.

It uses the tags to find compatible JobFlows, in the case where it will

share an existing one.

44 Settings slide

Instance types
AWS keys
Location for logs
SSH key
Bootstrap behavior
Setup commands per job

44.1 Info notes

Instance types: how beefy a machine do you want to run? Is your load

memory or CPU bound? Yelp uses a few standard options, starting with m1.large

EMR keys: These are used to communicate with AWS to start your instance
Location for logs: Where in S3 do the results of the run go? EMR logs results of

bootstrap actions, Hadoop logs results of running the query.

SSH key: Once the instance is started, mrjob can connect to it via SSH and

stream logs, and setup a tunnel for HTTP traffic to use the Hadoop interface

Bootstrap behavior: Often you'll need to install libraries, or upload your

own repository. Since your code is running on the entire cluster at once, this install needs to happen on all machines

45 Bootstrap vs Setup slide

Bootstrap runs at VM creation
Setup runs at the start of each job
Bootstrap for run-once non-code config

45.1 Setup vs Bootstrap notes

When repeatedly changing and testing your code, you'll want a new version

deployed to the machine each time. That is what setup commands are for: parts of your application that change

46 Book Keeping slide

–output-dir s3://output-bucket/path/
–no-output
intermediate results stored on HDFS

47 Tracking slide two_col

mrjob opens an SSH tunnel to the master node for Hadoop stats
On job failure, will download logs and use heuristics to find likely error

48 EMR Pricing slide

Job flows are billed hourly, rounding up

48.1 Typical usage notes

Spin up a job flow, when finished, kill it
Remaining time and end of last hour is wasted
Billed for it, but not used

48.2 Job Flow Pooling slide

Don't end job flow after job, share remaining time

New jobs can wait N minutes for an idle flow

48.2.1 Pooling notes

Billed on the hour, so don't give up early!
Only shut down idle job flows at the end of the hour
If another job comes up, reuse idle flow
Also makes job faster to spin up BONUS for developers

49 How To slide

python code/unique_review.py -r emr -c mrjob.conf s3://i290-jblomo/data/selected_reviews.json.gz