2013-02-15-Data-Warehouse
Table of Contents
- 1 Data Warehouse
- 2 Database Types
- 3 Properties
- 4 OLAP or OLTP?
- 5 Overview
- 6 Types of Data Warehouses
- 7 Metadata
- 8 Overview
- 9 Datacube
- 10 Dimensions
- 11 Schemas
- 12 Data Warehouse Operations
- 13 Materialized Views
- 14 Architecture
- 15 Break
1 Data Warehouse
2 Database Types
- Data Warehouse
- Database designed for using data to make decisions
- OLAP
- OnLine Analytical Processing
- OLTP
- OnLine Transactional Processing
2.1 Data Mining notes
- These databases are often the starting point for data mining in companies
- Most of the data sets from companies typical come from exporting some portion of their data warehouse
3 Properties
- Subject Oriented
- Focus on core business objects
- Integrated
- Access to as much data as possible
- Time Variant
- Contains historical data with time parameter
- Non-volatile
- Updated (relatively) infrequently, in bulk
3.1 Examples notes
- Yelp users can be directed to a datacenter depending on conditions. This data probably doesn't need to be in the DW
- Yelp has severl databases: log summaries, user info, salesforce. Most useful if they are all in the same place
- Operationally, when someone changes their address, we just overwrite it in the OLTP DB. But DW potentially cares about the old value
- OLTP writes to rows every time someone updates profile, review, etc. Lots of simultaneous updates. DW: typically once a day, in bulk
4 OLAP or OLTP? animate
- Transactional Focus vs. Analytic Focus
- Used by Managers, Executives vs. DBAs, programmers
- Contains current information vs. Historical
- Variety of differently summarized data vs normalized
- Short transactions vs. Long queries
- Full table scans vs. Indexes on for fast lookups
- Simultaneous queries: 100s-1000s vs 1-100
- Simple updates vs Complex queries
- Guarenteed high performance vs Flexibility & Customization
5 Overview
5.1 From the front notes
- Analytics team uses charts, reports, et
- Generated from an OLAP server
- Which uses data from a data warehouse (often DW and OLAP server are integrated)
- Uses a process (ETL) to move the data from other source into DW
6 Types of Data Warehouses
- Enterprise
- turnkey solution, often expensive, sophisticated but complex ingestion, integration, security features
- Data Mart
- Smaller, limited in scope. Designed for specific team or department
- Virtual
- OLAP built on top of an OLTP database
- Cloud
- Google BigQuery, Amazon RedShift
6.1 Vendors notes
- Enterprise
- Oracle, Greenplum, AsterData
- Data Mart
- MySQL, PostgreSQL
- Virtual
- MySQL, PostgreSQL views or admin interface
7 Metadata
- Data about the data being stored
- Overview: schema, languages
- Operational: last update, query latency
- Algorithms: normalization, transformation
- Performance: job dependencies
- Business: ownership, permissions
7.1 Considerations notes
- As soon as several people start using the DW, they'll need to know about how it is put together
- Metadata often comes as an after thought but is an important part of scaling
8 Overview
8.1 Data Cubes notes
- Why are there cubes in the OLAP area?
9 Datacube two_col
- Way of thinking about multi dimensional data
- Useful metaphor because one can reason about ways to satisfy a query
10 Dimensions
Day 1 | Day 2 | Day 3 | |
Region 1 | $200 | $80 | $600 |
Region 2 | $300 | $90 | $650 |
Region 2 | $400 | $100 | $700 |
10.1 Data… Square notes
- More of a data square: only 2 dimensions
- Advertising on Yelp
- Now we want to know Product TYpe of things sold (CPC, CPM, National)
10.2 Cube: 3rd Dimension
10.2.1 More notes
- Now we want to know Page Type (Business, Search, Home)
- Hard to draw 4 dimensions, so instead…
10.3 Multi-Cube
10.3.1 More notes
- Keep adding dimension as necessary
10.4 Lattice
10.4.1 Moving notes
- Move back and forth from our 2d table
- To our 3d cube, to our 4d multi-cube
- The lower dimensions summarize table
- At the extreme is just the total (ie all money made)
11 Schemas two_col
- Data cube a way of visualizing multi dimensional data
- Star schema is a way store the data in a database
11.1 Fact table
11.2 Dimension table
11.3 Dimension tables
11.4 Dimension tables
11.5 Dimension tables
11.6 Star Schema
11.7 Dimensions of Dimensions
11.8 Dimensions of Dimensions
11.9 Dimensions of Dimensions
11.10 Dimensions of Dimensions
11.10.1 Schema Name? notes
- Any guesses what this fractal looking schema is called?
11.11 Snowflake Schema
- Schema with radiating dimension tables
11.12 Constellation Schema
- Schema with several fact tables and related dimensions
12 Data Warehouse Operations
- Rollup
- Summarize data along fewer dimensions
- Drill-down
- Get details within a particular dimension
- Slice
- Select a particular value in a dimension
- Dice
- Consider a subset of the values in a dimension
- Pivot
- Swap, or rotate dimensions
12.1 Examples notes
- Rollup
- What countries are selling the most ads?
- Drill-down
- Spike in Q1 ad views. Which month most responsible?
- Slice
- Chart sales only for CPC
- Dice
- Only look at sales in US, IT, DE
- Pivot
- Swap axis on a chart
13 Materialized Views
- View
- virtual table defined by a query
- Full
- Pre-compute and store
- None
- Calculate summaries on the fly
- Partial
- Variety of strategies: eg. cache results after calculating
13.1 Usefulness notes
- In DW, often storing different cubes in the lattice
- For the country sample, do we have those summaries stored in another DB table? On disk? By month? Year?
- Storing all possible summarize expensive when loading data, and requires a lot more storage
14 Architecture
- ROLAP
- Relational. Implement OLAP on top of a relational database
- MOLAP
- Multidimensional. Implements data cube as storage paradigm
- HOLAP
- Hybrid. Data in ROLAP, rollups in MOLAP
- Specialized
- Often distributed storage, parallel DB technology
- NoSQL
- Store data as key-value pairs, optimized in different ways
14.1 Details notes
- ROLAP
- MySQL, PostgreSQL
- MOLAP
- Oracle, Palo
- HOLAP
- MS SQL
- ???
- Specialized: AsterData, Greenplumb
- ???
- NoSQL: Hive, BigTable, Cassandra