2013-04-26-Outliers

1 Outliers
2 Generative Model
- 2.1 Questions
3 Outliers
- 3.1 Intuitive
4 Outlier Types
5 Which Type?
6 Types of Learning
- 6.1 :notes:
7 Outlier Methods
- 7.1 Features
8 Statistical
- 8.1 Why Statistical
- 8.2 Statistical Example
  - 8.2.1 Pros/Cons
9 Grubb's Test
- 9.1 Pros/Cons
10 Outlier Distance
- 10.1 Limitations
11 Mahalanobis Distance
12 Contextual Outliers
13 Break

1 Outliers slide

2 Generative Model slide animate

"Real" model that produced original data points
Our mission is to reproduce the original model
Thus we have different techniques that can model different behavior

2.1 Questions notes

What is a "generative model"?
What is data mining trying to discover? What is machine learning hoping to reproduce?
Why have different classifiers? Decision tree, Naive Bayes, etc?

3 Outliers slide two_col

Significant deviation
Probably generated through a different model than the rest of the data
Normal / Abnormal

img/outlier.jpg

3.1 Intuitive notes

We all have a pretty good intuitive understand of what outliers are
Mathematically, you can express the variation as a different generative model
Normal / Abnormal data (be careful about using it in human contexts)
img: http://enriquegortiz.com/wordpress/enriquegortiz/research/undergraduate/

4 Outlier Types slide

Global: points which deviate from the rest of the entire data set. Point anomalies
Contextual: points which deviate from their peers. Conditional outliers
Collective: points which deviate as a group, even though individual points may not be considered outliers.

5 Which Type? slide

Given class sizes at Berkeley:
- A day with 10 people in class
- A day with 7000 people in class
- 3 weeks of 15 people in this class
Given Earth's temperatures:
- A day at 100°C
- 30 straight days of rain in Berkeley
- A day at 100°F

6 Types of Learning slide two_col

Supervised
Unsupervised
Semi-Supervised

6.1 notes

Supervised: learning from "gold standard" labels
Unsupervised: learning without labels
Semi-Supervised: infer more labels from a few, learn based on inferred + labeled
img: https://www.coursera.org/course/ml

7 Outlier Methods slide

Supervised: Label outliers, treat as classification problem
Unsupervised: Cluster data, find points not clustered well
Semi-Supervised: Manually label few, find point nearby to automatically label, then treat as classification
Statistical: Decide on a generative model / distribution, find points which have a low probability of belonging
Proximity: Use relative distance to neighbors

7.1 Features notes

Some methods may be overlapping
When developing features for classification, using relative features can be helpful: eg. distance from mean
Eg. Agglomerative clustering, find lone/small groups that are last to glom together
Eg. k-means find points which are "far" out from centroids
- Determining "far", "last" can be application specific, part of the challenge
What algorithm could we use to automatically label nearby points? k-nearest neighbor
Statistical: Again, must define "low" in your domain
Proximity: basically translating features into another, relative, space, then applying a different type of outlier detection (eg. statistical)

8 Statistical slide

Assume a distribution
Determine parameters
Calculate probability of a point be generated by distribution

8.1 Why Statistical notes

We've covered supervised, clustering, so let's skip to statistical methods
Most straight forward way is to use distributions

8.2 Statistical Example slide two_col

Assume normal distribution
Determine mean and standard distribution
If (point-mean)/stddev > 3, consider outlier

img/gaussian-simple.png

8.2.1 Pros/Cons notes

Straight forward
Can use % to intuitively motivate (3 stdevs is outside 99.7%)
But must manually determine cut-off
How do we know we got the parameters right?

9 Grubb's Test slide

Takes into account sample size; reliability of mean/stddev measurements
Take Z-score of a point, assign to G
Student t-test: used to measure the distribution of actual mean from a sample

img/grubbs.png

9.1 Pros/Cons notes

Z-score: abs(x-u)/s
This isn't actually that different from measuring stddev
But accounts for sample size, can express your confidence with alpha 95% (0.05)
Not going to go into t-test/t-distribution here, but basically it helps show where the mean likely is, given a set of sample data.

10 Outlier Distance slide animate

How to take outliers in > 1 dimension?
Translate distance to 1 dimension, find outliers
How to measure distance?

10.1 Limitations notes

What are the limitations of the techniques we've seen?
Limited to one dimension! Taking mean, stddev, etc. applies to 1 dimension
Euclidean: doesn't take into account dependent variables

11 Mahalanobis Distance slide two_col

y depends somewhat on x
Euclidean distance measures all dimensions equally
Use covariance matrix to normalize distances in each dimension
Matrix in which E_i,j is the covariance of i, j dimensions

img/GaussianScatterPCA.png

11.1 Mahalanobis notes

How to capture intuition that a distance along major axis is different than along this minor axis?
Expand this drawing into 3 dimensions
Euclidean distance will equally weight something that is out in the z direction as something that is along this primary scatter area

11.2 Mahalanobis Definition slide

Find mean vector
Normalize by covariance

img/mahalanobis.png

11.3 Some Math notes

Some extra math tricks to make the units work out:
We're taking the squared distance, then taking the square root
DM has squared Mahalanobis distance defined
What happens if we have no covariance? S is the Identity matrix

12 Contextual Outliers slide

Typically reduce scope to context, use global techniques
Example: Calculate normal distribution for Berkeley weather
Collective outliers: find collections, use as context

2013-04-26-Outliers

Table of Contents

1 Outliers slide

2 Generative Model slide animate

2.1 Questions notes

3 Outliers slide two_col

3.1 Intuitive notes

4 Outlier Types slide

5 Which Type? slide

6 Types of Learning slide two_col

6.1 notes

7 Outlier Methods slide

7.1 Features notes

8 Statistical slide

8.1 Why Statistical notes

8.2 Statistical Example slide two_col

8.2.1 Pros/Cons notes

9 Grubb's Test slide

9.1 Pros/Cons notes

10 Outlier Distance slide animate

10.1 Limitations notes

11 Mahalanobis Distance slide two_col

11.1 Mahalanobis notes

11.2 Mahalanobis Definition slide

11.3 Some Math notes

12 Contextual Outliers slide

13 Break slide