2013-03-15-Advanced-Cluster

Table of Contents

1 Advanced Clustering    slide

1.1 Questions from last week    notes

  • Heuristic for choosing # of clusters? root(n/2)
  • k-means: how to calculate the centroid?
  • k-medoids: how to calculate the medoid?

2 Review    slide

  • Clustering groups points by using similarity
  • Build up, or break down groups
  • Each point belongs to 1 cluster

2.1 Types    notes

  • Agglomerative, Divisive
  • Assign each point to 1 centroid (k-means)
  • or assign group clusters together, starting with every point as a cluster (hierarchical)

3 Topics?    slide

  • "This place is great. I've been here for meetings, to get work done, to hang out with friends, and on dates, and it's fit the bill every time. In the summer, there's a wonderful patio, and in the winter it's cozy and warm. All the food I've had is delicious – especially the salad dressing!"

3.1 Model by topic    notes

  • ambiance,
  • romantic
  • good for work
  • food quality
  • Many
  • So how do we model?

4 Fuzzy Clusters    slide

  • Membership is a degree in [0-1]
  • 1 = sum(membership(v,c) for c in clusters)
  • Every point belongs to at least 1 cluster

4.1 Restrictions    notes

  • Every point still must be in a cluster
  • Think of the degree as a probability
  • Probabilities must add up to 100% (1)

5 Generative Model    slide animate

  • "Real" model that produced original data points
  • Our mission is to reproduce the original model
  • Thus we have different techniques that can model different behavior

5.1 Questions    notes

  • What is a "generative model"?
  • What is data mining trying to discover? What is machine learning hoping to reproduce?
  • Why have different classifiers? Decision tree, Naive Bayes, etc?

6 HW: 1-D Clustering    slide animate

  • Imagine plotting the points on the number line
  • I generated these points with a process
  • Two Gaussian arrays, concatenated

6.1 Questions    notes

  • Draw number line
  • How did I come up with these numbers? I wanted two clusters.
  • What parameters did I use, in the code to generate this specific set of N numbers?

7 Parameters    slide two_col

  • Median
  • Standard Deviation
  • How many points to generate from each

img/gaussian-simple.png

7.1 Translation    notes

  • I had two distributions (median, stddev)
  • Now I picked one or the other with a certain probability
  • Then generated a number from it
  • In reality, just generated 10 from A, 10 from B, but you can imagine that being 50% 50%

8 Generative Model    slide

img/gaussian.png

8.1 3 Clusters    notes

  • We have all three parameters:
  • median
  • stddev
  • probability of choosing distribution (height)

9 Best Fit?    slide two_col

img/gaussian-badfit.png

img/gaussian-goodfit.png

9.1 Choose    notes

  • These letters are points on our number line
  • Which is more likely to be generated by our real model?
  • But we don't know the generative model, so how do we discover it?

10 Revisit k-means    slide animate

  • Each object assigned to closest cluster
  • Reset center of the cluster to average
  • Repeat until steady

10.1 Questions    notes

  • What are the steps of k-means?

11 Expectation-Maximization    slide

Expectation
Given current state, create a solution that fits our expectations
Maximization
Adjust the state to maximize the likelihood of the solution being true
Terminate
When adjustments do not change

11.1 k-means translation    notes

  • Our expectation in k-means is that points belong to the cluster closest to them
  • Our state or parameters for our model are the locations of the centers of those clusters
  • The maximization step therefore moves the centers to maximize the likelihood of their being the true center

12 Revisit k-means    slide animate

  • Each object assigned to closest cluster
  • Reset center of the cluster to average
  • Repeat until steady

12.1 Questions    notes

  • What are the steps of k-means?

13 Fuzzy Clustering    slide

  • Each object assigned to closest cluster
  • Each object assigned probability of cluster
  • Reset center of the cluster to average
  • Reset center of the cluster to weighted average
  • Repeat until steady

13.1 Change    notes

  • Only difference here is that we're calculating the probability of a point belonging to a cluster
  • What should we base that probability off of? distance
  • If point A has a high probability of belonging to cluster C, what can you say about A and C? Close

14 Distance    slide

  • dist(o,C) / sum(dist(o,c) for c in clusters))

14.1 Similarity    notes

  • Our old friend distance

15 Distance2    slide

  • dist(o,C)**2 / sum(dist(o,c)**2 for c in clusters))
  • Squared distance to primary cluster, divided by squared distance to all clusters

15.1 Similarity    notes

  • Our old friend distance
  • And squared, to make sure we stay positive

16 Reset Center    slide

  • sum(weight[c][p]**2 * p for p in points)
  • Squared weight for this point in this cluster, multiplied by point coordinates

16.1 Weighting by distance    notes

  • Distance of point affects how much a cluster center is pulled toward it

17 Stability    slide

  • When our centroids stabilize, we can estimate the parameters of our distributions
  • Or use probabilities of points directly

17.1 Uses    notes

  • Sometimes you may not need original parameters

18 Break    slide

Date: 2013-03-15 13:43:10 PDT

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0