2013-02-15-MapReduce

1 MapReduce
- 1.1 Spoilers
2 Yelp has a problem
- 2.1 Too long
3 Solution?
- 3.1 New Challenges
4 Do It Yourself
- 4.1 Dependencies
5 Big Idea
- 5.1 Really Big Idea
6 MapReduce
7 Point?
8 Word Count
- 8.1 Classic
9 Steps
- 9.1 Hints
- 9.2 Examples
10 Multi-Step
11 Unique Review

1 MapReduce slide

1.1 Spoilers notes

Don't look ahead in the slides
If you know MapReduce, try to let others answer and genuinely think about how you would solve the problem.

2 Yelp has a problem slide

250+ GB of logs per day
Each GB takes 10 minutes to process
How long to handle a day's logs?

img/yelp-growth.png

2.1 Too long notes

On a single machine 40+ hours!
If we really had only a single machine, we wouldn't be able to keep up!
Mistake can't be fixed in a day (billing especially important)

3 Solution? slide animate

Don't use one machine!
What are the new challenges?
Distributing data
Calculating overall statistics
Failures

3.1 New Challenges notes

With many machines, how to they get access to the 100 GB of logs?
How do they coordinate who gets which section of logs?
How do we calculate the average?
What happens when one of the boxes dies?
- Detecting failure (timeout waiting for data? Out of band?)
- Decide who takes over the data

4 Do It Yourself slide two_col

There are many ways to deal with these challenges
Often, people would "roll" their own solutions depending on the problem
Google implemented a generic solution, shared idea

img/mapreduce-paper.png

4.1 Dependencies notes

Did you have a super-computer?
What programming language were you using?
Type of problem being solved (working on graphs, or web logs, …)

5 Big Idea slide

Simplify, limit solution expression
Enable sophisticated implementation

Interface: Map() Reduce()
Implementation: Reliably run over 1000s of machines

5.1 Really Big Idea notes

Limiting yourself to what can be expressed may seem like a loss
But it enables the implementation to handle the problems we talked about
And then can be used as understandable building blocks

6 MapReduce slide

Map: Extract a property to summarize over
Reduce: Summarize all items with a particular property

Simple: Each operation stateless

6.1 Reading notes

Reading this week includes a video explaining MapReduce much more generally
This lecture will focus on it from a practical standpoint for homework
MapReduce's main benefits are for running over many machines, fault tolerance
But we'll just practice on one machine

6.2 Example slide

Web application logs
How many actions have we seen?
- Business views
- User profile views
- Searches

6.2.1 Details notes

Business Views: Triple Rock, Bear Raman
User profile: jimblomo.yelp.com
Searches: query, location

6.3 Logs slide

{'page_type': 'search',
 'user': 'jim', 'query': ...}

{'page_type': 'biz_view',
 'user': 'shreyas', 'biz_id': 55}

{'page_type': 'user_profile',
 'user': null, 'profile_id: 123}

...

6.3.1 Logs notes

JSON logs, various types of information
entire record on one line (wrapped for slides)

6.4 Map slide

Input: Key, Value
Output: Keys, Values

6.5 Map Example slide

Input Key: Log line number
Input Value: Log line text
Output Key: Action
Output Value: times this action has occurred on this line

6.5.1 Counts notes

Log line number is not helpful in our specific case
Log line text: we hope it is machine readable so we can accurately extract the action
It has datetime, cookie, action, etc.
How many times has this action occurred? 1
Tunnel vision: all we care about is this line

6.6 Actions? slide

search       1 
biz_view     1 
user_profile 1 
search       1 
biz_view     1 
search       1 
biz_view     1 
user_profile 1 
search       1

6.6.1 Middle Step notes

From log lines, we've extracted the information out that we care about
The counts and the actions
Next step summarize
Next step after Map?

6.7 Reduce slide

Input: Key, Values
Output: Keys, Values

6.7.1 Values notes

Note: The input is values! Plural
Because we get a key and all of its associated values
Remind me: what are we trying to get out of this computation?
So what do you think the output keys are?
Values?

6.8 Reduce Example slide

Input Key: Action
Input Values: Counts: [1,1,1,1]
Output Key: Action
Output Value: Total Count

6.8.1 Details notes

Action is one of search biz_view profile_view
To get total count, sum all of the counts

6.9 Example Output slide

Output Key: Action
Output Value: Total Count

"search"       4
"user_profile" 2
"biz_view"     3

7 Point? slide

A lot of work for counting!
More complex calculations can be done this way, eg. PageRank
Stateless constraint means it can be used across thousands of computers

7.1 Details notes

By only looking at keys and values, can optimize a lot of back-end work
Where to send the results?
What to do when a computer fails? (Just restart failed part)

7.2 Implementation slide

biz_view     1 
user_profile 1 
search       1 
search       1 
biz_view     1 
search       1 
biz_view     1 
user_profile 1 
search       1

7.3 Intermediate notes

This was the situation after map
Keys all jumbled
What Hadoop does is sort them and distribute them to computers

7.4 "Shuffle" slide

biz_view     1 
biz_view     1 
biz_view     1 
search       1 
search       1 
search       1 
search       1 
user_profile 1 
user_profile 1

7.5 Distribute notes

Now it is easy to distribute, and can handle all the biz_view at once

7.6 Inputs slide

MapReduce distributes computing power by distributing input
Input is distributed by splitting on lines (records)
You cannot depend on lines being "together" in MapReduce

7.6.1 Splitting Files notes

Image you have a lot of large log files, GB each
You'd like to let different machines work on the same file
Split file down the middle, well, at least on a newline
Enable two separate machines to work on the parts
You don't know what line came before this one
You don't know if you will process the next line
Only view is this line
Real life slightly more complicated, but mostly hacks around this

8 Word Count slide

{"text": "Greatest pizza ever", "stars": 2, "user": ...}

{"text": "good pizza selection", "stars": 5, "user": ...}

Total uses of a word in across all reviews

8.1 Classic notes

This is the traditional MapReduce example, so let's solve it
No skipping ahead

9 Steps slide animate

Map
Extract text
Count words in that review
Key: word , Value: count
Reduce
Key: word , Values: all counts
sum(values)

9.1 Hints notes

What's the first step (of MapReduce)
What part of the record are we interested in?
What do we want with those words?
Mapper: Key Value? What are we grouping by?
Next step (of MapReduce)
What are the reducer inputs
with all of these counts, how do we summarize

9.2 Examples slide animate

"Greatest pizza ever"
Counts
- Greatest: 1
- pizza: 1
- ever: 1
Reducer, Key: pizza
- Values: [1, 1]
- Output: ["pizza", 2]

10 Multi-Step slide

Not all computations can be done in a single MapReduce step
Map Input: <key, value>
Reducer Output: <key, value>
Compose MapReduce steps!

10.1 Output as Input notes

The output of one MapReduce job can be used as the input to another

10.2 Examples slide

PageRank: Multiple steps till solution converges
Multi-level summaries

10.3 PageRank notes

PageRank is an algorithm for calculating the important of a page
But it depends on the importance of every page pointing to it!
So iteratively calculate the important of all pages
Find average presidential donations by candidate, then normalize averages

11 Unique Review slide animate

Review ID with the most unique words
Map Input: <line number, text>
Map Output: <word, review_id>
Reducer Input: <word, [review_ids]>
Reducer Output: <review_id, 1> if the word is unique

11.1 Questions notes

For our purposes, what is always the mapper input?
What feature do we want to calculate first?
Given this mapper output, what must the reducer input be?
What property about a review are we interested in?

11.2 Step 2: Count Unique Words slide animate

Map Input: <review_id, 1>
Map Output: <review_id, 1>
Reducer Input: <review_id, [1,1,…]>
Reducer Output: <review_id, sum>

11.3 Questions notes

Given the reducer output, what must the mapper input be (for chained MapReduce steps)
What do we want to group by?
Given this mapper output, what must the reducer input be?
What are we calculating?

11.4 Step 3: Max slide animate

Map Input: <review_id, sum>
Map Output: <"MAX", [sum, review_id]>
Reducer Input: <"MAX", [[sum, review_id],…]>
Reducer Output: <review_id, sum> of the max(sum)

11.5 Questions notes

Given the reducer output, what must the mapper input be (for chained MapReduce steps)
We're calculating a statistic over what portion of the data set?
What stat are we calculating?

2013-02-15-MapReduce

Table of Contents

1 MapReduce slide

1.1 Spoilers notes

2 Yelp has a problem slide

2.1 Too long notes

3 Solution? slide animate

3.1 New Challenges notes

4 Do It Yourself slide two_col

4.1 Dependencies notes

5 Big Idea slide

5.1 Really Big Idea notes

6 MapReduce slide

6.1 Reading notes

6.2 Example slide

6.2.1 Details notes

6.3 Logs slide

6.3.1 Logs notes

6.4 Map slide

6.5 Map Example slide

6.5.1 Counts notes

6.6 Actions? slide

6.6.1 Middle Step notes

6.7 Reduce slide

6.7.1 Values notes

6.8 Reduce Example slide

6.8.1 Details notes

6.9 Example Output slide

7 Point? slide

7.1 Details notes

7.2 Implementation slide

7.3 Intermediate notes

7.4 "Shuffle" slide

7.5 Distribute notes

7.6 Inputs slide

7.6.1 Splitting Files notes

8 Word Count slide

8.1 Classic notes

9 Steps slide animate

9.1 Hints notes

9.2 Examples slide animate

10 Multi-Step slide

10.1 Output as Input notes

10.2 Examples slide

10.3 PageRank notes

11 Unique Review slide animate

11.1 Questions notes

11.2 Step 2: Count Unique Words slide animate

11.3 Questions notes

11.4 Step 3: Max slide animate

11.5 Questions notes