2013-02-01-Obtaining-Data

Table of Contents

1 Obtaining Data    slide

2 Ways to Collect    slide two_col

  • Operational Data
  • Data Warehouse
  • Unstructured Data
  • External API
  • Data Sets

img/bottlecaps.jpg

2.1 Operational Data    slide

  • Most frequent in industry
  • Usually stored in databases best suited for transactional use
  • Challenge is reorganizing data to suit question

2.1.1 Data from production    notes

  • Most frequently you'll have data that is being used by the application, and you'll want to find insights in it
  • We'll go into more detail in another class, but online use is optimized for small queries and small updates
  • Frequently just accessing the data in bulk is a software engineering problem:
    • ensuring long queries don't hold up production usage
    • joining across databases via software
    • understanding esoteric columns, like "flags"
  • Often will want to reorganize data to look like transactional
  • img: http://woodwarddesign.ca/blog/2009/03/06/bottle-caps/

2.2 Example    slide

  • Find the user names with most "liked" reviews on Yelp

users

userIDnameflags
25234Jim0x200

reviews

reviewIDbusinessIDuserIDstarstextflags
28252432252344great place!0x1

feedback

reviewIDsrcUserIDufcFlagsflags
28282050x10x0

2.2.1 Distributed Data    notes

  • At Yelp we have a variety of database tables, and those tables can be spread across different databases
  • At a minimum we frequently need to JOIN across tables to answer queries
    • eg. matching up user names with reviews from separate tables
  • It is possible the review table is only indexed on business ID, and so finding all reviews by a user is really disk intensive: make sure you're not slowing down the whole site!
  • An additional challenge is when the "feedback" tables are in a separate database: can no longer issue normal SQL queries
  • What are these "flag" columns for?
  • Exactly: no one knows. Often must look into code, or compare data to production representation to guess meaning. In Yelp, 0x1 often means "inactive", so we probably don't want to count that feedback

2.3 Data Warehouse    slide two_col

  • Data located on same system
  • Organized for analytics queries
  • Requires extra maintenance and understanding of construction

img/Ikea-Warehouse.jpg

2.3.1 No free lunch    notes

  • A strong data warehouse can be a big improvement over operational data
  • Hopefully, someone has already cleaned, joined data in a way that makes sense!
  • Optimized for long running queries: less fear of brining down website!
  • But you must learn how that process was accomplished in order to understand potential problems
  • How to handle missing data?
  • We'll go into more detail about how data warehouse schemas compare to online ones later in the course

2.4 Unstructured    slide

  • Haphazard collection of data
  • Unclear what structure should be
  • Examples: Web logs, text, multimedia
  • Must extract structure eventually

2.4.1 Yelp JSON logs    notes

  • When developing a web application, new context or details become important: how long did certain requests take? What link did a user follow to a website?
  • Relational Databases aren't well suited for these wide varieties of potential attributes that don't apply to all items
  • So the current work around is just to write all useful information down in a log, and extract what is needed later
  • Text, like business reviews, another example: desired structure changes radically between questions: How many words? Characters? What is the sentiment?
  • Pictures can contain attributes like color depth, length, width
  • First step of data mining is often imposing structure on data: the data is not inherently unstructured, it just is unclear what the structure should be until query time

2.5 Search Logs Example    slide

193.139.1 jim [10/Oct/2013:13:55:36 -0700] "GET /search?query=headache HTTP/1.1" 200 9288
282.482.3 shreyas [10/Oct/2013:13:56:36 -0700] "GET /search?query=bananas HTTP/1.1" 200 2929
345.114.1 steven [10/Oct/2013:13:56:37 -0700] "GET /search?query=cold HTTP/1.1" 200 8232
10.328.52 anne [10/Oct/2013:13:56:39 -0700] "GET /search?query=flu+shot HTTP/1.1" 200 2342
10.328.52 lily [10/Oct/2013:13:57:40 -0700] "GET /search?query=i290 HTTP/1.1" 200 2342
userNamedatequery
jim10/Oct/2013:13:55:36 -0700headache
shreyas10/Oct/2013:13:56:36 -0700bananas
steven10/Oct/2013:13:56:37 -0700cold
anne10/Oct/2013:13:56:39 -0700flu shot
lily10/Oct/2013:13:57:40 -0700i290

2.5.1 Imposing Structure    notes

  • Extract only the rows we know follow a format
  • Format queries from some encoding (eg. URL) to standardized format

2.6 External APIs    slide

  • Better documented than internal data!
  • More limited in amount and detail
  • Commonly HTTP/REST based

2.6.1 Motivation    notes

  • Companies are often searching for other ways to leverage their data
  • Both for immediate business purposes, and for brand recognition
  • Twitter more (in)famous example
  • NYTimes another good option

2.7 NYTimes API Example    slide

2.7.1 Accessing these    notes

  • More info on how to access these APIs is in the Web Architecture class, but feel free to ask Shreyas or I about how best to access them

2.8 Data Sets    slide

  • Download large, curated set of data all at once
  • Formats vary, but usually documented
  • Can be useful to combine with other datasets or APIs

img/kaggle-digits.png

2.8.1 Research    notes

  • Data sets commonly used in research: can compare different techniques on same data to understand advantages
  • Sizes can range to a few MB to GB
  • JSON, CSV, XML all potential formats. Cleaning, organization for your question again becomes an important aspect

2.9 Data Set Example    slide two_col

3 Exploring Data    slide

  • Data sets are frequently too large to fit in standard tools like Excel or Word
  • Simplest to explore on the command line
  • Homework will be exploring a data set of your choice

3.1 Size    notes

  • Some formats will not be easily parsed into Excel: eg. JSON, XML
  • Word will be slow, or unworkable for GB size data
  • CLI provides many composable tools for text manipulation

4 Yelp Academic Dataset    slide

  • Yelp Data Set covers reviews, users, businesses
  • To download, you'll need to sign up: process takes ~24 hours for approval
  • Use .edu email

4.1 Example    notes

  • We'll use this as an example, you can use any data set of your choice
  • Just for HW, don't need to use for project

5 CLI introduction    slide

  • Standard commands available in Learn CLI the hard way
  • All example will be run on ischool.berkeley.edu
  • Sheyas and I available for more help

5.1 Help    notes

  • If you're new, don't be intimidated.
  • Security policies ensure you can't break anything besides your own files
  • Keep backups of important stuff anyway

6 wget    slide

  • Used for downloading files
  • Downloading with the browser is fine, but sometimes nice to use faster connection, or download it directly to machine you're working on
$ wget 'http://www.grouplens.org/system/files/ml-100k.zip'

6.1 Command    notes

  • Just wget URL
  • I like to use quotes in case there are special characters in the URL, eg ?
  • Will download to current directory, same name as remote file

7 scp    slide

  • Copy a file to or from a remote machine
  • Uses same connection as SSH, but copies data instead
  • Example: Copy data you've downloaded in your browser
$ scp ~/Downloads/ml-100k.zip jblomo@ischool.berkeley.edu:
# or
$ scp ~/Downloads/ml-100k.zip jblomo@ischool.berkeley.edu:i290/movielens-100k.zip

7.1 Command    notes

  • Trailing : is important: signifies remote machine
  • If you don't specify path or filename, will copy the file with the same name into your home directory

8 gunzip unzip    slide

  • Uncompress data sets for simpler, faster manipulation
$ unzip ml-100k.zip
# or
$ gunzip yelp_academic_dataset.json.gz

8.1 Commands    notes

unzip
expand potentially many file, leave original alone
gunzip
expand original file, leaving only the uncompressed version

9 less    slide

  • View a file
  • History: original command was called more to see a file a page at a time
  • "Less is more"
less yelp_academic_dataset.json

10 Searching in less    slide

  • / (forward slash) lets you input search text
  • q will quit
/type": "user"
/type": "review"

10.1 Command    notes

  • Useful for finding specific instances to investigate

11 grep    slide

  • Find and print lines matching a "regular expression"
  • Regular expressions are "find" on steroids, but you can use simple strings
$ grep 'type": "review"' yelp_academic_dataset.json

12 wc    slide

  • "wordcount" counts characters, words, lines
  • Most useful in data sets for lines: -l
$ wc -l yelp_academic_dataset.json
474434 yelp_academic_dataset.json

13 Composable    slide

  • Genius of Unix: do one thing well, compose commands to get what you want
  • | pipe characters "sends" output from one program to the input of another
  • How many reviews in the data set?
$ grep 'type": "review"' yelp_academic_dataset.json | wc -l
330071
$ egrep -o 'business_id": "\w+"' yelp_academic_dataset.json  | sort -u | wc -l
9592

Date: 2013-02-01 13:45:14 PST

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0