2013-03-07-Hierarchical

1 Hierarchical & Density
2 Methods
- 2.1 Algorithms
3 k-means Input
- 3.1 Difference
4 Agglomerative
- 4.1 Bottom-up
5 Cluster Distance
- 5.1 Meta Distance
6 Termination
- 6.1 Details
7 Dendrogram
- 7.1 Usefulness
8 CHAMELEON
- 8.1 Details
9 Results
- 9.1 Properties
10 Density: DBSCAN
- 10.1 Details
11 Density Trade-offs
- 11.1 Details
12 Algorithm Choice
- 12.1 Lessons
13 Elbow Method
14 Labels
- 14.1 Homogeneity
15 Break

1 Hierarchical & Density slide

2 Methods slide

Partitioning: Construct k groups, evaluate fitness, improve groups
Hierarchical: Agglomerate items into groups, creating "bottom-up" clusters; or divide set into ever smaller groups, creating "top-down" clusters
Density: Find groups by examining continuous density within a potential group
Grid: Chunk space into units, cluster units instead of individual records

2.1 Algorithms notes

Partitioning: k-means, k-medoid
Hierarchical: Build groups 1 "join" at a time, examining distance between two things that can be joined together, if close, combine groups. Reverse: divisive.
Density: Many of the above methods just look for distance. This method tries to find groups that might be strung out, but maintain a density. Think about an asteroid belt. It is one group, but not clustered together in a way you typically think.
Grid: Read the book

3 k-means Input slide animate

What did we supply to k-means before we ran it?
k: number of clusters
Clusters disjoint*
Hierarchical clustering builds up clusters incrementally

3.1 Difference notes

Hierarchical can find cluster of clusters
Can illustrate clusters at many levels, let human intemperate what makes sense without guess-and-check
Clusters are built 1 cluster at a time, starting with all points being their own cluster
*We'll learn about "fuzzy" clustering next time, where cluster membership is a probability

4 Agglomerative slide

All points are separate clusters
Find closest clusters: Join them
Repeat

img/agglomerative.png

4.1 Bottom-up notes

Any questions about this?
What does "close" mean?

5 Cluster Distance slide

Minimum: Use the two closest points
Maximum: Use the two farthest points
Mean: Use the mean of the two clusters
Average: Sum of the distances of all pairs, divided by number of pairs

5.1 Meta Distance notes

These are actually distance metrics for clusters that translate down to distance metrics for points.
Still need to decide distance measures for points: Euclidean, Manhattan, etc. And that's just for numerical distance
Choose based on expected cluster topology, cross validation testing using human observers

6 Termination slide

Define have k clusters
Distance between clusters exceeds threshold
Fitness function for cluster

6.1 Details notes

If you wanted to look at all potential k, set k to 1, then look at sub clusters
Distance or fitness function (eg. density or minimum intra-cluster similarity score) can help define k automatically

7 Dendrogram slide two_col

img/dendrogram1.jpg

Display of clustered groups
Concise visualization: groups do not need to be identified or named
Y axis can represent iteration

7.1 Usefulness notes

Can move up and down clustering to make sense of individual clusters

8 CHAMELEON slide two_col

Discover large number of small clusters
Group together small clusters
Join clusters with a high interconnectedness relative to their existing interconnectedness

img/chameleon.png

8.1 Details notes

Mix of partition & agglomerative
Partition by finding groups of k-nearest neighbors: A, B in the same group if A is a k-nearest neighbor of B.
Interconnectedness measured by aggregate proximity in the group, or using a network model the book provides details on (10.3.4)

9 Results slide

img/chameleon-cluster.png

9.1 Properties notes

Tends to "follow" clusters as long as interconnectedness stays high

10 Density: DBSCAN slide two_col

Find "paths" of points that are in "dense" regions
Paths: points within a distance e
Density: surrounded by MinPts within region of radius e

img/density-connected.png

10.1 Details notes

Can find non linear "paths" to follow as long as they stay dense

11 Density Trade-offs slide two_col

img/DBSCAN.png

Finds clusters of different sizes, shapes
DBSCAN is sensitive to the parameters used. How big is e? How many points is "dense"?

11.1 Details notes

img: http://en.wikipedia.org/wiki/DBSCAN

12 Algorithm Choice slide

Simple techniques often work surprisingly well
Choose other algorithms to tackle specific problems
Evaluation metrics

12.1 Lessons notes

Just like Naive Bayes, we make assumptions about our data that turn out to be right enough: clusters are uniformly sized, don't wander around our dimensioned space
Topic drift: tendency for a cluster to change its properties slowly over time: eg. articles on politics might use different words
Performance: many of these algos are computationally expensive, hard to distribute. Book goes into run times and where to make compromises on the algo
Figure out a fitness function for your metric. If you used these clusters to take action, what would be the result?

13 Elbow Method slide two_col

Calculate intra-cluster variance
Compare to data set variance (F-test)
Find point where marginal gain of explicative power decreases

img/elbow.JPG

14 Labels slide

Clustering is an example of unsupervised learning
But after clustering, humans can label clusters, and their contents
Now one can use homogeneity metrics to evaluate clusters

14.1 Homogeneity notes

Gini Index
Entropy
Precision / Recall

2013-03-07-Hierarchical

Table of Contents

1 Hierarchical & Density slide

2 Methods slide

2.1 Algorithms notes

3 k-means Input slide animate

3.1 Difference notes

4 Agglomerative slide

4.1 Bottom-up notes

5 Cluster Distance slide

5.1 Meta Distance notes

6 Termination slide

6.1 Details notes

7 Dendrogram slide two_col

7.1 Usefulness notes

8 CHAMELEON slide two_col

8.1 Details notes

9 Results slide

9.1 Properties notes

10 Density: DBSCAN slide two_col

10.1 Details notes

11 Density Trade-offs slide two_col

11.1 Details notes

12 Algorithm Choice slide

12.1 Lessons notes

13 Elbow Method slide two_col

14 Labels slide

14.1 Homogeneity notes

15 Break slide