Construct k groups, evaluate fitness, improve groups
Agglomerate items into groups, creating "bottom-up" clusters; or divide set into ever smaller groups, creating "top-down" clusters
Find groups by examining continuous density within a potential group
Chunk space into units, cluster units instead of individual records

2.1 Algorithms    notes

k-means, k-medoid
Build groups 1 "join" at a time, examining distance between two things that can be joined together, if close, combine groups. Reverse: divisive.
Many of the above methods just look for distance. This method tries to find groups that might be strung out, but maintain a density. Think about an asteroid belt. It is one group, but not clustered together in a way you typically think.
3 k-means Input    slide animate

  • What did we supply to k-means before we ran it?
  • k: number of clusters
  • Clusters disjoint*
  • Hierarchical clustering builds up clusters incrementally

3.1 Difference    notes

  • Hierarchical can find cluster of clusters
  • Can illustrate clusters at many levels, let human intemperate what makes sense without guess-and-check
  • Clusters are built 1 cluster at a time, starting with all points being their own cluster
  • *We'll learn about "fuzzy" clustering next time, where cluster membership is a probability

4 Agglomerative    slide

  • All points are separate clusters
  • Find closest clusters: Join them
  • Repeat


4.1 Bottom-up    notes

  • What does "close" mean?

5 Cluster Distance    slide

Use the two closest points
Use the two farthest points
Use the mean of the two clusters
Sum of the distances of all pairs, divided by number of pairs

5.1 Meta Distance    notes

  • These are actually distance metrics for clusters that translate down to distance metrics for points.
  • Still need to decide distance measures for points: Euclidean, Manhattan, etc. And that's just for numerical distance
  • Choose based on expected cluster topology, cross validation testing using human observers

6 Termination    slide

  • Define have k clusters
  • Distance between clusters exceeds threshold
  • Fitness function for cluster

6.1 Details    notes

  • If you wanted to look at all potential k, set k to 1, then look at sub clusters
  • Distance or fitness function (eg. density or minimum intra-cluster similarity score) can help define k automatically

7 Dendrogram    slide two_col


  • Display of clustered groups
  • Concise visualization: groups do not need to be identified or named
  • Y axis can represent iteration

7.1 Usefulness    notes

  • Can move up and down clustering to make sense of individual clusters

8 CHAMELEON    slide two_col

  • Discover large number of small clusters
  • Group together small clusters
  • Join clusters with a high interconnectedness relative to their existing interconnectedness


8.1 Details    notes

  • Mix of partition & agglomerative
  • Partition by finding groups of k-nearest neighbors: A, B in the same group if A is a k-nearest neighbor of B.
  • Interconnectedness measured by aggregate proximity in the group, or using a network model the book provides details on (10.3.4)

9 Results    slide


9.1 Properties    notes

  • Tends to "follow" clusters as long as interconnectedness stays high

10 Density: DBSCAN    slide two_col

  • Find "paths" of points that are in "dense" regions
  • Paths: points within a distance e
  • Density: surrounded by MinPts within region of radius e


10.1 Details    notes

  • Can find non linear "paths" to follow as long as they stay dense

11 Density Trade-offs    slide two_col


  • Finds clusters of different sizes, shapes
  • DBSCAN is sensitive to the parameters used. How big is e? How many points is "dense"?

11.1 Details    notes

12 Algorithm Choice    slide

  • Simple techniques often work surprisingly well
  • Choose other algorithms to tackle specific problems
  • Evaluation metrics

12.1 Lessons    notes

  • Just like Naive Bayes, we make assumptions about our data that turn out to be right enough: clusters are uniformly sized, don't wander around our dimensioned space
  • Topic drift: tendency for a cluster to change its properties slowly over time: eg. articles on politics might use different words
  • Performance: many of these algos are computationally expensive, hard to distribute. Book goes into run times and where to make compromises on the algo
  • Figure out a fitness function for your metric. If you used these clusters to take action, what would be the result?

13 Elbow Method    slide two_col

  • Calculate intra-cluster variance
  • Compare to data set variance (F-test)
  • Find point where marginal gain of explicative power decreases


14 Labels    slide

  • Clustering is an example of unsupervised learning
  • But after clustering, humans can label clusters, and their contents
  • Now one can use homogeneity metrics to evaluate clusters

14.1 Homogeneity    notes

  • Gini Index
  • Entropy
  • Precision / Recall

