2013-03-01-SVM

Table of Contents

1 Linear Regression    slide

2 Types of Models    slide animate

  • Classifiers
  • Regressions
  • Clustering
  • Outlier

2.1 Details    notes

Classifiers
describes and distinguishes cases. Yelp may want to find a category for a business based on the reviews and business description
Regressions
Predict a continuous value. Eg. predict a home's selling price given sq footage, # of bedrooms
Clustering
find "natural" groups of data without labels
Outlier
find anomalous transactions, eg. finding fraud for credit cards

3 Case Study    slide

  • Housing prices: square footage

img/housing-regression.gif

3.1 Problem    notes

  • We'd like to know how to price a house based on the square footage
  • Let's pretend this is the data we have
  • How would we guess that value for 2500 sq ft?

4 Solution?    slide animate

  • Find a line that represents the data
  • y = m*x + b
  • A line that is not very far from the points

4.1 Prompts    notes

  • In English, how would you solve this?
  • How to mathematically represent the line?
  • What is a good line?

5 Similarity    slide

  • Main challenges in data mining: defining a specific metric for an intuition
  • Define distance for an individual point
  • Define how to aggregate distances together

5.1 Challenge    notes

  • This is big problem for engineering and math (stats) in general
  • We'll cover some concepts, but if you're ever stuck, try looking in related fields
  • What are some of the ways we can measure distance between points? Euclidian, Manhattan, Euclidian == L2 norm
  • What is a way to aggrgate numbers? sum, sum of squares, sum of logs
  • Differences between the last two?

5.2 Log & Square    slide two_col

Log
Useful for de-emphasizing large raw differences
Square
Useful for taking the approximate absolute value

img/logx.gif

6 Point Distance    slide two_col

  • y distance from line
  • Intuitively: error in estimate
  • h(x) = m*x + b
  • err = h(x) - y

img/error.gif

6.1 Error    notes

  • We want the difference from what we estimate to be the value to what the value actually is

7 Aggregate    slide animate

  • sum
  • What about negative error?
  • Sum of squares
  • err = sum( (h(x) - y)**2 for x,y in dataset) / len(dataset)

7.1 Questions    notes

  • Now we have info about all the errors from points, how to summarize?
  • Some points have negative error, some positive? Do they cancel each other out?
  • Imagine data set of two points: one solutions covers lines, other divides them. Which is better?
  • Use our squaring trick to make sure we don't have any negative values
  • Normalize by the number of points

8 Fitness Function    slide

  • Measures the quality or cost of the solution
  • Key ingredient for data mining algorithms
  • If you can measure it, you can find the best solution

8.1 Fitness    notes

  • Function spits out a metric. Metric can be thought of as fitness or cost
  • Find the maximum or minimum of that metric
  • Depending on your fitness function, this can be easy or difficult
  • img: http://onlinestatbook.com

9 Understanding Error    slide

img/Linear_regression.svg.png Several possible solutions

9.1 Error    notes

  • What happens to the error as we move line around?
  • Decreases until best fit, then increases
  • What happens if we plot this error? Say, slope (x) against error (y)?

10 Solution as Minimization    slide two_col

  • Error is a parabola
  • Several methods for finding the minimum
  • Two categories: analytical, approximations

img/parabola.png

11 Solution Approximation    slide

  • Some fitness functions can be difficult to solve analytically
  • Alternative: iteratively get closer to the solution
  • Stop when answer is close enough

11.1 Analytical    notes

  • How to find the minimum of functions in general?
  • Take derivative, find 0
  • Taking derivative can be complex or impossible (discontinuities) for some functions, or solving for 0 is difficult
  • Instead, well keep getting closer to the minimum using the function we already have

12 Gradient Descent    slide two_col

  1. Estimate current gradient (derivative)
  2. Take a step (a * deriv) in the direction of the gradient
  3. Step size is small, stop. Else repeat.

img/parabola.png

12.1 Steps    notes

  • Take gradient by looking at the local derivative, or perturbating x
  • Choose a as step size weight: big a is large step size
  • If deriv is large, will also make you step size large.
  • If deriv is large, probably means you are far away from minimum
  • Keep repeating
  • What happens if a is too small?
  • What happens if a is too big?

13 General Case    slide two_col

  • Formulate fitness function for your problem
  • Use analytics or approximations to find min/max
  • Approximations: Newton's Method, Gradient Descent

img/error-reduce.png

13.1 Approximate visualization    notes

  • Desired output of the error as gradient descent runs
  • maybe some local problems, as step size is too big, but slowly move down to a small amount of error

14 Support Vector Machines    slide

15 Decision Trees    slide two_col

  • Great for separable attributes
  • Rules operate on independent attributes
  • Classes separable along an axis/attribute

img/tree.png

15.1 Linearly Separable    slide

  • How to handle case where separator line is not along an axis?

img/dataset_linsep.png

15.2 Details    notes

16 Possibilities    slide

  • Many lines could separate these classes

img/dataset_linsep.png

16.1 Best?    notes

  • Which is the best?
  • Why?

16.2 Best Separator    slide two_col

  • Best line gives the most distance between the two classes
  • Measure distance between closest points
  • Closest points == support vectors

img/separable.jpg

16.3 Points, Vectors    notes

17 Dimensions    slide

  • When separating two dimensions, we need a line
  • When separating 3 dimensions?
  • 4 dimensions?

17.1 Vocabulary    notes

  • Plane
  • Hyperplane

18 Expressing the Hyperplane    slide animate

  • y = m*x + b
  • x_2 = m*x_1 + b
  • 0 = m*x_1 + b - x_2
  • 0 = [m -1] * [x_1, x_2] + b
  • 0 = w * x + b

18.1 Questions    notes

  • How do you mathematically represent a line?
  • Now, we're not going to think of a new letter for every dimension, we're just going to say x1 , x2 , x3
  • Rewrite mathematically
  • How to add more dimensions? x22? Express x as a vector of all attributes
  • Again, don't want to come up with a bunch more letters after m, so use w as the matrix representing all the m slopes

19 Challenge    slide two_col

  • Find w, b such that w * x + b maximizes the distance between the support vectors

img/svm.png

20 Maximizing Fitness Function    slide two_col

  • Now we have a fitness function and parameters we're trying to optimize
  • Sound familiar?

img/svm.png

21 Kernel Tricks    slide two_col

  • SVM good for linearly separable data
  • How to handle other data?

img/svm-circular.jpg

21.1 Polynomial Kernel    slide

  • Transform it into linearly separable
  • What function can we apply to these data points to make them separable?

img/svm-circular.jpg

21.1.1 Square    notes

  • Square all of them

21.2 Polynomial Kernel    slide

img/kernel-trick.jpg

Now apply SVM

22 Break    slide

Date: 2013-03-15 10:19:25 PDT

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0