Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Warehousing: Understanding Data Integration, Star Schemas, and OLAP Queries, Exams of Data Mining

An overview of data warehousing, focusing on data integration, basic olap queries, and star schemas. Topics include data cubes, slicing and dicing, molap vs. Rolap, and sql olap extensions. The document also discusses etl processes, data integration challenges, and federated databases.

What you will learn

  • What are the advantages and disadvantages of data warehouses versus federated databases?
  • What are the key components of a data warehouse?
  • What is ETL and what are its three main steps?
  • What is a star schema and how is it used in data warehousing?
  • What is data warehousing and why is it important?

Typology: Exams

2017/2018

Uploaded on 09/12/2018

abhay_swati
abhay_swati 🇮🇳

1 document

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Warehousing
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Partial preview of the text

Download Data Warehousing: Understanding Data Integration, Star Schemas, and OLAP Queries and more Exams Data Mining in PDF only on Docsity!

Data Warehousing

Outline

• Data integration

• Basic OLAP queries

– Data cubes

– Slice and dice, drill down, roll up

– MOLAP vs. ROLAP

– SQL OLAP Extensions: ROLLUP, CUBE

• Star Schemas

– Facts and Dimensions

Terminology: ETL

• ETL = Extraction, Transformation, & Load

• Extraction: Get the data out of the source

systems

• Transformation: Convert the data into a useful

format for analysis

• Load: Get the data into the data warehouse

(…and build indexes, materialized views, etc.)

Data Integration is Hard

  • Data warehouses combine data from multiple sources
  • (^) Data must be translated into a consistent format
  • (^) Data integration represents ~80% of effort for a typical

data warehouse project!

  • (^) Some reasons why it’s hard:
    • (^) Metadata is often poor or non-existent
    • (^) Data quality is often bad
      • Missing or default values
      • Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California)
    • (^) Inconsistent semantics
      • What is an airline passenger?

Warehouses vs. Federation

  • (^) Advantages of federated databases:
    • (^) No redundant copying of data
    • (^) Queries see “real-time” view of evolving data
    • (^) More flexible security policy
  • (^) Disadvantages of federated databases:
    • (^) Analysis queries place extra load on transactional systems
    • (^) Query optimization is hard to do well
    • (^) Historical data may not be available
    • (^) Complex “wrappers” needed to mediate between analysis server and source systems
  • (^) Data warehouses are much more common in practice
    • (^) Better performance
    • (^) Lower complexity
    • (^) Slightly out-of-date data is acceptable

Two Approaches to Data Warehousing

  • Data mart: like a data warehouse, but smaller and more

focused

  • (^) Top-down approach
    • (^) First build single unified data warehouse with all enterprise data
    • Then create data marts containing specialized subsets of the data from the warehouse
  • (^) Bottom-up approach
    • First build a data mart to solve the most pressing problem
    • (^) Then build another data mart, then another
    • (^) Data warehouse = union of all data marts
  • (^) In practice, not much difference between the two
  • (^) Our book advocates the bottom-up approach

Slicing and Dicing

Jul Aug Sep

CA

OR

WA

Red Blue Gray Red Blue Gray Jul Aug Sep

CA

OR

WA

Blue Jul Aug Sep

CA

OR

WA

Blue Jul Aug Sep Total

Querying the Data Cube

  • (^) Cross-tabulation
    • (^) “Cross-tab” for short
    • (^) Report data grouped by 2 dimensions
    • (^) Aggregate across other dimensions
    • (^) Include subtotals
  • (^) Operations on a cross-tab
    • (^) Roll up (further aggregation)
    • (^) Drill down (less aggregation) CA OR WA Total Jul 45 33 30 108 Aug 50 36 42 128 Sep 38 31 40 109 Total 133 100 112 345 Number of Autos Sold

“Standard” Data Cube Query

• Measurements

  • (^) Which fact(s) should be reported?

• Filters

  • (^) What slice(s) of the cube should be used?

• Grouping attributes

  • (^) How finely should the cube be diced?
  • (^) Each dimension is either:
    • (^) (a) A grouping attribute
    • (^) (b) Aggregated over (“Rolled up” into a single total)
  • (^) n dimensions → 2 n^ sets of grouping attributes
  • (^) Aggregation = projection to a lower-dimensional

subspace

Full Data Cube with Subtotals

• Pre-computation of aggregates → fast answers

to OLAP queries

• Ideally, pre-compute all 2n^ types of subtotals

• Otherwise, perform aggregation as needed

• Coarser-grained totals can be computed from

finer-grained totals

  • (^) But not the other way around

MOLAP vs. ROLAP

• MOLAP = Multidimensional OLAP

• Store data cube as multidimensional array

• (Usually) pre-compute all aggregates

• Advantages:

– Very efficient data access → fast answers

• Disadvantages:

– Doesn’t scale to large numbers of dimensions

– Requires special-purpose data store

Sparsity

  • Imagine a data warehouse for Safeway.
  • (^) Suppose dimensions are: Customer, Product, Store, Day
  • (^) If there are 100,000 customers, 10,000 products, 1,

stores, and 1,000 days…

  • (^) …data cube has 1,000,000,000,000,000 cells!
  • (^) Fortunately, most cells are empty.
  • (^) A given store doesn’t sell every product on every day.
  • (^) A given customer has never visited most of the stores.
  • (^) A given customer has never purchased most products.
  • Multi-dimensional arrays are not an efficient way to store

sparse data.

Creating a Cross-tab with SQL

SELECT state, month, SUM(quantity)

FROM sales

GROUP BY state, month

WHERE color = 'Red'

Grouping

Attributes

Measurements

Filters

What about the totals?

  • (^) SQL aggregation query

with GROUP BY does not

produce subtotals, totals

  • (^) Our cross-tab report is

incomplete.

CA OR WA Total Jul 45 33 30? Aug 50 36 42? Sep 38 31 40? Total???? Number of Autos Sold State Month SUM CA Jul 45 CA Aug 50 CA Sep 38 OR Jul 33 OR Aug 36 OR Sep 31 WA Jul 30 WA Aug 42 WA Sep 40