2013-02-15-Data-Warehouse

Table of Contents

1 Data Warehouse    slide

2 Database Types    slide

Data Warehouse
Database designed for using data to make decisions
OLAP
OnLine Analytical Processing
OLTP
OnLine Transactional Processing

2.1 Data Mining    notes

  • These databases are often the starting point for data mining in companies
  • Most of the data sets from companies typical come from exporting some portion of their data warehouse

3 Properties    slide

Subject Oriented
Focus on core business objects
Integrated
Access to as much data as possible
Time Variant
Contains historical data with time parameter
Non-volatile
Updated (relatively) infrequently, in bulk

3.1 Examples    notes

  • Yelp users can be directed to a datacenter depending on conditions. This data probably doesn't need to be in the DW
  • Yelp has severl databases: log summaries, user info, salesforce. Most useful if they are all in the same place
  • Operationally, when someone changes their address, we just overwrite it in the OLTP DB. But DW potentially cares about the old value
  • OLTP writes to rows every time someone updates profile, review, etc. Lots of simultaneous updates. DW: typically once a day, in bulk

4 OLAP or OLTP?    slide animate

  • Transactional Focus vs. Analytic Focus
  • Used by Managers, Executives vs. DBAs, programmers
  • Contains current information vs. Historical
  • Variety of differently summarized data vs normalized
  • Short transactions vs. Long queries
  • Full table scans vs. Indexes on for fast lookups
  • Simultaneous queries: 100s-1000s vs 1-100
  • Simple updates vs Complex queries
  • Guarenteed high performance vs Flexibility & Customization

5 Overview    slide

img/olap-overview.png

5.1 From the front    notes

  • Analytics team uses charts, reports, et
  • Generated from an OLAP server
  • Which uses data from a data warehouse (often DW and OLAP server are integrated)
  • Uses a process (ETL) to move the data from other source into DW

6 Types of Data Warehouses    slide

Enterprise
turnkey solution, often expensive, sophisticated but complex ingestion, integration, security features
Data Mart
Smaller, limited in scope. Designed for specific team or department
Virtual
OLAP built on top of an OLTP database
Cloud
Google BigQuery, Amazon RedShift

6.1 Vendors    notes

Enterprise
Oracle, Greenplum, AsterData
Data Mart
MySQL, PostgreSQL
Virtual
MySQL, PostgreSQL views or admin interface

7 Metadata    slide

  • Data about the data being stored
  • Overview: schema, languages
  • Operational: last update, query latency
  • Algorithms: normalization, transformation
  • Performance: job dependencies
  • Business: ownership, permissions

7.1 Considerations    notes

  • As soon as several people start using the DW, they'll need to know about how it is put together
  • Metadata often comes as an after thought but is an important part of scaling

8 Overview    slide

img/olap-overview.png

8.1 Data Cubes    notes

  • Why are there cubes in the OLAP area?

9 Datacube    slide two_col

  • Way of thinking about multi dimensional data
  • Useful metaphor because one can reason about ways to satisfy a query

img/BorgFirstContact.jpg

10 Dimensions    slide

Day 1Day 2Day 3
Region 1$200$80$600
Region 2$300$90$650
Region 2$400$100$700

10.1 Data… Square    notes

  • More of a data square: only 2 dimensions
  • Advertising on Yelp
  • Now we want to know Product TYpe of things sold (CPC, CPM, National)

10.2 Cube: 3rd Dimension    slide

img/cube-3d.gif

10.2.1 More    notes

  • Now we want to know Page Type (Business, Search, Home)
  • Hard to draw 4 dimensions, so instead…

10.3 Multi-Cube    slide

img/cube-4d.png

10.3.1 More    notes

  • Keep adding dimension as necessary

10.4 Lattice    slide

img/cube-lattice.jpg

10.4.1 Moving    notes

  • Move back and forth from our 2d table
  • To our 3d cube, to our 4d multi-cube
  • The lower dimensions summarize table
  • At the extreme is just the total (ie all money made)

11 Schemas    slide two_col

  • Data cube a way of visualizing multi dimensional data
  • Star schema is a way store the data in a database

img/sun.jpg

11.1 Fact table    slide

img/star-1.png

11.2 Dimension table    slide

img/star-2.png

11.3 Dimension tables    slide

img/star-3.png

11.4 Dimension tables    slide

img/star-4.png

11.5 Dimension tables    slide

img/star-5.png

11.6 Star Schema    slide

img/star-schema.jpg

11.7 Dimensions of Dimensions    slide

img/star-6.png

11.8 Dimensions of Dimensions    slide

img/star-7.png

11.9 Dimensions of Dimensions    slide

img/star-8.png

11.10 Dimensions of Dimensions    slide

img/star-9.png

11.10.1 Schema Name?    notes

  • Any guesses what this fractal looking schema is called?

11.11 Snowflake Schema    slide

  • Schema with radiating dimension tables

img/star-snowflake.jpg

11.12 Constellation Schema    slide

  • Schema with several fact tables and related dimensions

img/star-constilation.jpg

12 Data Warehouse Operations    slide

Rollup
Summarize data along fewer dimensions
Drill-down
Get details within a particular dimension
Slice
Select a particular value in a dimension
Dice
Consider a subset of the values in a dimension
Pivot
Swap, or rotate dimensions

12.1 Examples    notes

Rollup
What countries are selling the most ads?
Drill-down
Spike in Q1 ad views. Which month most responsible?
Slice
Chart sales only for CPC
Dice
Only look at sales in US, IT, DE
Pivot
Swap axis on a chart

13 Materialized Views    slide

View
virtual table defined by a query
Full
Pre-compute and store
None
Calculate summaries on the fly
Partial
Variety of strategies: eg. cache results after calculating

13.1 Usefulness    notes

  • In DW, often storing different cubes in the lattice
  • For the country sample, do we have those summaries stored in another DB table? On disk? By month? Year?
  • Storing all possible summarize expensive when loading data, and requires a lot more storage

14 Architecture    slide

ROLAP
Relational. Implement OLAP on top of a relational database
MOLAP
Multidimensional. Implements data cube as storage paradigm
HOLAP
Hybrid. Data in ROLAP, rollups in MOLAP
Specialized
Often distributed storage, parallel DB technology
NoSQL
Store data as key-value pairs, optimized in different ways

14.1 Details    notes

ROLAP
MySQL, PostgreSQL
MOLAP
Oracle, Palo
HOLAP
MS SQL
???
Specialized: AsterData, Greenplumb
???
NoSQL: Hive, BigTable, Cassandra

15 Break    slide

Date: 2013-02-15 07:59:48 PST

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0