2013-02-22-Bayes

1 Classification: Bayes
2 Confusion Matrix
- 2.1 Obtain Data
3 Testing Data
- 3.1 Not from storks
4 Training Data
5 Bayes Theorem
- 5.1 Read
6 Spam
- 6.1 Words
7 Multiple Words
- 7.1 Naive
8 Practical concerns
- 8.1 Solutions
9 Ensemble
- 9.1 Requirements
10 Bootstrap Aggregating
- 10.1 Trade-offs
11 Boosting
- 11.1 Trade-offs
12 Many Decision Trees
- 12.1 Random Forests
- 12.2 Parallel

1 Classification: Bayes slide

2 Confusion Matrix slide

What are the ways that classification can be wrong?


	Predict: Positive	Predict: Negative
Actual: Positive	True Positive	False Negative
Actual: Negative	False Negative	True Negative

2.1 Obtain Data notes

How do we obtain this data?

3 Testing Data slide two_col

Data used to test a learned model
Test data was not used to learn
Where does test data come from?

img/stork.jpg

3.1 Not from storks notes

img: http://adamsparkadventures.blogspot.com/2011/09/stork-watch.html

4 Training Data slide

Set aside a portion of training data to test with
Test data:

img/k-fold1.png

4.1 Set Aside Testing slide

img/k-fold2.png

Testing Data | Training Data

4.2 Colors notes

Red: Testing
Green: Training

4.3 Cross Validation slide

img/k-fold3.png

Train and test model with different subsets of data

4.4 Testing the model notes

This is used to test the model
How well does it perform with a variety of inputs?
Is it robust against outliers

4.5 K-Fold Validation slide

img/k-fold4.png

Test against K sections of the data

4.6 Statistical Significance notes

Similar to the concept in stats: the more distinct samples you have, the better you know your data

4.7 K-Fold Validation slide

img/k-fold5.png

5 Bayes Theorem slide

img/bayes.png

Can calculate a posterior given priors

5.1 Read notes

Probability of A given B equals probability of B given A times prob of A divided prob of B
Importance is that we can figure out what future probabilities are based on what we've already seen

6 Spam slide

img/bayes-spam.png

Find the probability of spam given it contains a particular word

6.1 Words notes

What words would you associate with spam?
Are these the same across all people?
Why might you want to train a classifier per person?

7 Multiple Words slide animate

How to calculate probabilities of multiple independent events occurring?
Model words as independent events
Multiply probabilities

7.1 Naive notes

Words are not independent
San? Francisco is more likely
But works suprisingly well in practice

8 Practical concerns slide animate

What is the probability of a word we've never seen before?
Underflow: multiplying numbers still everything is rounded to 0
Normalizing words: v1agra

8.1 Solutions notes

divide by 0. Instead, add 1 to all words
using log of probabilities
Rules

9 Ensemble slide

Using multiple models simultaneously
Run all classifiers over new data, take majority vote
Netflix Prize won with combination of models from several teams

9.1 Requirements notes

Nice thing is that the diversity of models is important, and not so much the accuracy of any single model

10 Bootstrap Aggregating slide two_col

Bagging: training data collected with replacement
Learn models on different samples
Run models on new incoming data

img/bagging.png

10.1 Trade-offs notes

Fairly simple:
Majority vote
Train models independently
img: http://cse-wiki.unl.edu/wiki/index.php/Bagging_and_Boosting

11 Boosting slide

Train classifier to catch what the last one missed
Train and test first classifier
Find classification failures
Weight more heavily those failures in training a new model
Weight models by their accuracy

11.1 Trade-offs notes

Boosting can be susceptible to outliers
Longer to train
Observed to be more accurate

12 Many Decision Trees slide

Train trees with random selection of attributes, subset of data
Combine trees using majority or weights
What to call many arbitrarily picked trees?

12.1 Random Forests slide two_col

img/green-forrest.jpg

Used successfully in many recent competitions
Carry over robustness properties from individual decision trees
Can be trained in parallel

12.2 Parallel notes

Potentially good fit for MapReduce paradigms