2013-03-01-Neural-Network

Table of Contents

1 Bias vs. Variance    slide

2 Trade-offs    slide

  • Similar to precision, we make trade-offs when training models
  • Bias: How far off are the model predictions on average?
  • Variance: If we retrained with different data, how different would our guesses be?

2.1 Details    notes

  • Bias: difference in "Expected" value from models from the real value
  • Variance: difference in "Expected" value from each other
  • Variance: Another way to think about it: how specific is our model to our data? If we were training a tree with k-fold validation, would we get completely different rule sets for each set of data?
  • "Expected": These are model type properties. Train the model multiple times with different data, then evaluate all models performance

3 Regression    slide two_col

  • Can we do better than linear regression on some data sets?
  • Polynomial regression
  • How many polynomials?

img/overfit1.png

3.1 Polynomial    notes

  • Sure! Use a polynomial instead: x^2 2x - x^2 + 4x^3
  • If you're not sure what the underlying data model is, have to test
  • img: http://cheshmi.tumblr.com/

3.2 One    slide

img/overfit1.png

3.2.1 So-So    notes

  • How is the bias? Not great, fair amount of error
  • How is the variance? Pretty good, assuming random sample

3.3 Two    slide

img/overfit2.png

3.3.1 Better    notes

  • Bias? Better, less error
  • Variance? more risky depending on which samples you get, since model diverges quickly

3.4 Three    slide

img/overfit3.png

3.4.1 Worrying    notes

  • Now getting a little weird. We're not finding the general pattern, more like exactly fitting a line over these points
  • If we made model with different data, we're going to get a different line

3.5 Many    slide

img/overfit5.png

3.5.1 Now kind of ridiculous    notes

  • Intuitively we know this is not a description of the data
  • If a point was found near the border, completely dependant on the data the model trained on

4 Over-fitting    slide

  • Over-fitting: reflecting the exact data given instead of the general pattern
  • High variance is a sign of over-fitting: model guesses vary with the exact data given
  • Avoidance: ensembles average out variance, regularization adds a cost to model complexity

4.1 Avoidance    notes

  • Ensembles combine multiple models together. Those multiple models may have a lot of variance, but as long as they have good Bias, we'll center in on the correct result
  • Remember our cost function? We wanted to minimize the error. If you add in a way to measure model complexity, you can add that to the cost, so that you are explicitly trading-off the complexity of your model with the quality of the solution
  • If we wanted to add a complexity cost to the previous model, what would the cost be dependent on?

5 Neural Networks    slide

img/neuron_culture.jpg

6 Brains    slide

  • Neural networks try to model our brains
  • Neurons/perceptrons sense input, transform it, send output
  • Neurons/perceptrons are connected together
  • Connections have different strengths

7 Training    slide

  • Learn by adjusting the strengths of the connections
  • Mathematically, strength is a weight multiplier of the output
  • When we've found the right weights

8 Nomenclature    slide two_col

Input layer
neurons whose input is determined by features
Hidden layer
neurons that calculate a combination of features
Output layer
neurons that express the classification
Weights
numeric parameter to adjust input/output

img/nn.png

9 Handwriting    slide

  • Recognize handwritten digits

img/neuron11.gif

9.1 Inputs => Outputs    notes

10 Forward Propagation    slide

  1. Sum of inputs * weights
  2. Apply sigmoid
  3. Send output to next layer
  4. Repeat

10.1 Repeat    slide

  • Multiple hidden layers used to model complex feature interaction

img/2-layer-nn.gif

10.2 Sigmoid    slide two_col

  • Normalize input to [0,1]
  • Makes weak input weaker, strong input stronger
  • 1 / (1 + e^-input)

img/sigmoid.png]

11 Example    slide

img/nn-fp1.png

11.1 Simple    notes

  • Simple NN with just one output
  • Output can model true/false
  • Inputs are numerical

11.2 Weights    slide

img/ann2.png

11.2.1 Later    notes

  • We'll discuss how weights are determined later
  • Fill in the Hidden layer with sum of inputs * weights

11.3 Sigmoid    slide

img/ann3.png

11.3.1 Apply    notes

  • Apply the sigmoid to the incoming signals

11.4 Sigmoid    slide

img/ann4.png

11.4.1 Apply    notes

  • Apply the sigmoid to the incoming signals

11.5 Sigmoid    slide

img/ann5.png

11.5.1 Apply    notes

  • Apply the sigmoid to the incoming signals

11.6 Sigmoid    slide

img/ann6.png

11.6.1 Apply    notes

  • Apply the sigmoid to the incoming signals

11.7 Weights    slide

img/ann7.png

11.7.1 Repeat    notes

  • Take the outputs, apply weights, sum

11.8 Sigmoid    slide

img/ann8.png

11.8.1 Apply    notes

  • Apply the sigmoid to the incoming signals
  • Our result is greater than 0.5, so we can assume true
  • If we had multiple outputs, we could choose the highest one

12 Forward Propagation    slide

  1. Sum of inputs * weights
  2. Apply sigmoid
  3. Send output to next layer
  4. Repeat

12.1 Get an answer    notes

  • Now we have an output, but how do we train to get the right output?

13 Fitness Function    slide

  • Create a fitness function that measures the error
  • Take derivative and a step in the right direction
  • Try again

13.1 Neural Network    notes

  • NN training is conceptually similar to gradient descent
  • We want to get closer to the answer, so we adjust our weights based on the amount of incorrectness in the system
  • Adjust weights, try again

14 Back Propagation    slide

Run forward
Oj is output of node j
Calculate error of output layer
Errj = Oj(1-Oj)(Tj-Oj)
Caclulate error of hidden layer
Errj = Oj(1-Oj)*sum(Errk*wjk)
Find new weights
wij = wij + l*Errj*Oi
Repeat
To move closer to correct weights

14.1 Derivative    notes

  • Derivative of the sigmoid is O_j(1-O_j), so we're taking the gradient
  • l is the learning rate, similar to a step size in gradient descent

15 Example    slide

img/ann8.png

15.1 Expected    notes

# Expected Output is 0
t_6 = 0
# Actual Output
o_6 = 0.8387
# Output Error -0.11346127339699999
err_6 = o_6*(1-o_6)*(t_6-o_6)
# Setup hidden node 5
o_5 = 0.9933 ; w_56 = 1.5
# Error for node 5 = -0.0011326458827956695
err_5 = o_5*(1-o_5)*(err_6*w_56)
# Adjust weight to 0.37298917134759924
l = 10  # learning rate
w_56 = w_56 + l*err_6*o_5

16 Terminate Learning    slide

  • Changes in weights too small
  • Accuracy in training models is high
  • Maximum number or times for learning

16.1 Forward and Back    notes

  • Guess, correct, guess, correct
  • Stop when you've got a good model
  • or you model is not improving
  • or when you're out of time

17 Break    slide

Date: 2013-03-04 23:20:39 PST

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0