2013-04-05-Frequent-Pattern
Table of Contents
1 Frequent Patterns
2 Finding Patterns
- Cookies frequently purchased with milk
- Website signups frequently occurring after reading FAQ
- DNA sections frequently seen with a drug reaction
2.1 Patterns notes
- set of items
- subsequences of actions
- substructures
- Generalized to any kind of pattern that occurs "frequently" in the dataset
3 Market Basket
- What things are frequently purchased together?
- Apocryphal example: beer and diapers
- Can be used for any natural grouping
3.1 Details notes
- Example of how patterns are discovered is to look at groups of actions
- One natural group is the shopping basket: what items are in it?
- But can also be applied to anytime there is a natural grouping
- Eg. web session logs group naturally around a person and time window
4 Define "Frequently"
- Action
A
andB
- Support
- probability that a transaction contains
A ∪ B
- Confidence
- conditional probability that a transaction having
A
also containsB
4.1 Probabilities notes
- We have two actions
A
andB
- Out of all the groupings, how many had both items?
- Out of all the groupings with
A
, how many hadB
?
5 Minimums two_col
- Min Support
- lower bound on support probability
- Min Confidence
- lower bound on confidence probability
- Strong
- Rule that satisfies both minimums
5.1 "Frequently" notes
- Now we can talk about what frequently means
- It doesn't matter if two very unpopular items were purchased together: car battery and smoke detector
- Also don't care if
A
happens a lot: everybody buys milk, so not a big deal if some bought milk and strawberries - Also important to note confidence is not symmetric: buying strawberries may be frequent with buying milk, but not visa versa
6 Too Many Rules
- Patterns not limited to 2 events
- But looking for all patterns leads to combinatorial number of options
a,b,c,d,e |
a,b |
a,c |
… |
a,b,c |
a,b,e |
… |
7 Subset Patterns
- Max-Pattern
X
rule is frequent and there exists no frequent super-patternY
- Closed
X
rule is frequent and there exists no super-patternY
with the same support- Shortcut
- Find only max-pattern or closed patterns, let other patterns be subsets
7.1 Shortcut notes
- So how can we calculate all the potentially frequently occurring patterns?
- We can find either the max or closed pattern that encompasses all of the patterns we're looking for
- These are more easily tracked, and we can still derive all of the frequently occurring sub-patterns
- We can use the reverse: if a rule or item is not frequent enough alone, its
super-set will not be frequent enough:
- If
A
is does not meet min support, there's no way forA,B
to make support
- If
8 Apriori
- Find supported single event rules
- Combine to make 2-event rules, check DB for support
- Combine to make 3-event rules, check DB…
- Stop when no N-event rules
9
9.1 Speed notes
- Isn't that slow? Yes!
- Book has some techniques to speed it up, mostly around grouping
- Can group together sets and if the group does not meet the support threshold, then none of the members do
10 Interesting Patterns two_col
- Strong rules may not always be interesting rules
- Basketball => eat cereal [40%, 66.7%] is strong
- But "not cereal" has a bigger effect on if you play basketball
Basketball | Not basketball | Sum | |
Cereal | 2000 | 1750 | 3750 |
Not cereal | 1000 | 250 | 1250 |
Sum | 3000 | 2000 | 5000 |
10.1 Details notes
- Not cereal column: has a huge effect on if someone plays basketball
- cereal + basketball… sure it happens frequently, but you'd actually expect to see a bigger effect
11 Lift
P(A ∪ B) / P(A)*P(B)
- If
A
andB
independent, what is likelihood ofA
andB
?
11.1 Correlation notes
- 1
- so if lift > 1, you're seeing something that is happening more often than random
- < 1 means they negatively correlated
- X2, cosine, others in book