1 Preprocessing    slide

2 Real World is Dirty    slide

missing timestamps for actions
salary = -10
age: 42, birthday: 1997-03-07

2.1 Types of dirty    notes

lacking some attribute values, containing only aggregate data. Eg. We often regret not including timestamps on different actions like UFCing, instead of tracking total votes (aggregation)
Containing errors, like impossible salary data, or decimals in the wrong place
If there's every two fields that depend on each other, in a large dataset you'll find them disagreeing. Errors ofen come from failures: processes failing halfway into updating

3 Causes of Problems    slide

  • Humans
  • Software
  • Hardware

3.1 Problems    notes

  • Berkeley experiment to measure temperature across campus
  • Turned out average on campus much warmer than external weather services predicted
  • But sample data looked in line with predictions
  • Problem: one monitoring station right next to air conditioning unit!
  • Hardware failure rare, but with large numbers of machines, probable. Eg. RAM can suffer ~1 bit/hour/gigabyte (ECC can help)

3.2 Inconsistent Different Sources    slide

  • Great value in combining data sources
  • Challenge is merging them together, removing duplicates
  • Example: Business names

3.3 Business names    notes

  • Starbucks vs. Starbucks Coffee Shop
  • Buck's vs Bucks
  • Trying to use address? Stackbucks vs. Starbucks across the street
  • Best strategy here is to use DM/ML techniques on the combination of features to determine likelihood of match. We'll discuss specific algorithms later in the course

4 Preprocessing    slide

fill missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies
merging data from multiple sources
obtain a smaller data set that can sufficiently answer important questions
change data to a form that is easier to mine or analyze

4.1 Flu Trend Problems (Questions)    notes

  • We have millisecond search resolution, but will only be plotting on a per day basis
  • We have the exact text of each query, but just care if it is about the flu or not
  • Flu Trends, we sometimes see out of control search bots doing 100,000s of searches per day
  • Mobile phone searches and web searches hit different machines, software, logs
  • We have IPs in the logs, but will by plotting against geographical areas

5 Missing Values    slide two_col



5.1 What to do?    notes

  • (Heights are made up)
  • We want to get an average class height
  • Q: What to do with missing rows?
  • ignore, fill, constant, average, average wrt gender

6 Fill Missing Values    slide animate

  • Ignore the record
  • Find value manually
  • Global constant
  • Average
  • Average with respect to class
  • "Most probable"

6.1 Details    notes

core to engineering
simply drop from data set. Hope there are not too many to affect answer. Drawbacks? When missing values are all same class (skew data)
Find value manually
Even for a small class, might be difficult. Get ruler, measure them. For historical data, impossible.
Global constant
replace with "N/A" or "6 foot". Can skew data, or cause data to pop in other analysis (all grouped together)
Mean or median. Either one has potential problems.
Average with respect to class
gender. Average female/male height to fill in values
"Most probable"
Think of as another step from avg -> class avg. Now throw in other details: age, family history, shoe size. Then weight depending on how much those factors are correlated. Pretty soon you have a regression or Bayesian model, which will cover later

7 Normalization    slide

  • Type of data transformation to make reasoning and comparison easier
  • Is 6' tall?
  • Coefficients on attributes in regressions understandable

7.1 Context, Comparison    notes

  • 6' Might be tall for this class, but not on a basketball team
  • How to know when a data point "average" or towards the top of a range?
  • For our housing model, we wanted to use sq. footage and # of bedrooms. But the sq. footage number is huge compared to bedrooms. If we didn't normalize, a formula for determine house price might seem to indicate that # of bedrooms was way more important

8 Min-max    slide


8.1 New Range    notes

  • Typically new range is
    • [0-1] (thought of as %)
    • [-1-1] (though of as bad->good

9 Z-score    slide


9.1 Uses    notes

  • When you want a relative measure of deviation
  • When you have a distribution estimate, but are unsure of absolute min-max

10 Comparison    slide

img/outliers.png img/outliers-minmax-zscore.png

10.1 Min-max vs Z-score    notes

  • Min-max: Known range
  • Z-score: more expressive range
  • Min-max: requires knowing min-max
  • Z-score: can estimate with sampling or informed guess

11 Removing Noise    slide

create B bins << N data samples, use aggregate statistic of bin for value
fit data to a function, use function value
Outlier analysis
find outlying points, understand and/or ignore them

11.1 Monitoring Problem    notes

  • For the problem encountered in temperature monitoring, which makes the most sense?

11.2 Trade-offs    slide

Simple way to remove outliers, but difficult to pick buckets correctly
If one metric is a direct function of another, what extra information does the value provide?
Outlier analysis
Manual process of understanding outliers, ignoring them can obscure some analysis (eg. income disparity)

11.2.1 Trade-offs again    notes

  • Remember: this class is exposing you to potential tools, up to you to be asking the right questions, selecting the appropriate algorithms, interpreting results

12 Data integration    slide

  • Merging two data sources
  • Problem: uniquely identify a concept in both sources
  • Find data points that are very "close" to each other, call them the same with some probability
  • Example: Yelp Menu Data

12.1 Yelp Menu Data    notes

  • Recently launched menu data
  • Takes data about the restaurant menu, find reviews & pictures referring to the menu item
  • Joins them together
  • Many different metrics for "close": remember them?

13 Other measures of "close"    slide

Are A and B close?


13.1 Correlation    notes

  • Imagine A and B have several different dimensions, maybe things like length, height, width, radius
  • Are they similar?
  • On one hand no: clearly different order of magnitude
  • Another way to think about similarity is correlation
  • All of B dimensions are 30x of A
  • Maybe just using different units!
  • If I plotted A and B and x,y, what would the result look like?

14 Χ2 Correlation Test    slide

img/correlation.png img/chiequation.jpg

14.1 Motivation    notes

  • Answer: a straight line
  • So a correlation coefficient gives a sense of how closely linearly related two data sets are
  • Note, besides positive & negative, the slop does not affect the correlation score, just how well fit the data is
  • Also note I said linear: patterns may still be exhibited, but they are not linearly related, eg 30x
  • Details of test are in book, you are expected to understand it
  • Motivation: how different are the observed values from the expected?
  • Expected is calculated using probability with the assumptions that the sets are independent

15 Covariance & Correlation    slide

  • Correlation is "normalized" covariance
  • Covariance describes the degree to which two data sets track each other in units of of the two data sets
  • Correlations describes the degree of similarity without units

15.1 Use in industry    notes

  • Χ2 used most commonly, handy to have an expected [0-1] range
  • "Correlation does not imply causation"
  • A->B, B->A, C->A,B, A->B->A…, coincidence

16 Data Reduction    slide two_col

remove attributes that are the same or similar to other attributes
represent or aggregate the data, sometimes with precision loss
generalized techniques to decrease the number of bytes needed to store data


16.1 Deep Dive    notes

  • We're only going to cover selected topics in these areas.
  • When reading, make sure to understand the intuition behind the other techniques, but if we don't cover it in lecture, you won't need to calculate it in midterm
  • Ask questions about the concepts you don't understand! That's what separates this class from a book :)
  • But still potentially useful for your projects!
17 Subset Selection    slide

  • Two many attributes?
  • Ignore some
  • Tricky part: which to ignore?
  • height x width = area

17.1 Simple to Sophisticated    notes

  • Ignore the ones that are not helpful
  • Ignore an attribute highly correlated with another (cm, in)
  • Ignore an attribute that can be built from others

18 Principal Component Analysis    slide


  • Map data to a locatoin along a few vectors

18.1 Higher dimensions    notes

  • Remember, 2 dimensions might not make much sense, but becomes useful in higher number of dimensions
  • These points described by two attributes, <x,y>
  • What if we wanted to describe them in just 1 dimension?
  • Pick some good vectors (in our case 1)
  • Describe where a point is located using only those vectors

19 Netflix and PCA    slide

  • A user may have many preferences: Mission Impossible, Love Actually, Man from Nowhere, …
  • Instead of keeping track of every preference, we can summarize
  • Action, RomCom, Foreign

19.1 Summarize in discovered dimensions    notes

  • With 3 or more "categories", we can reconstruct the user's likely preferences
  • Dimensions don't necessarily fit into human notions: probably is not an "foreign" dimension, but a subtle combination of other aspects

