Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Data Warehousing: Understanding Data Integration, Star Schemas, and OLAP Queries, Exams of Data Mining

Dr. Babasaheb Ambedkar Marathwada University Data Mining

An overview of data warehousing, focusing on data integration, basic olap queries, and star schemas. Topics include data cubes, slicing and dicing, molap vs. Rolap, and sql olap extensions. The document also discusses etl processes, data integration challenges, and federated databases.

What you will learn

What are the advantages and disadvantages of data warehouses versus federated databases?
What are the key components of a data warehouse?
What is ETL and what are its three main steps?

What is a star schema and how is it used in data warehousing?
What is data warehousing and why is it important?

Typology: Exams

2017/2018

Uploaded on 09/12/2018

abhay_swati 🇮🇳

1 document

1 / 35

This page cannot be seen from the preview

Don't miss anything!

bg1

Data Warehousing

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

pf1b

pf1c

pf1d

pf1e

pf1f

pf20

pf21

pf22

pf23

Related documents

OLAP: Online Analytical Processing and Data Warehousing

Data Integration: Mediators, Warehousing, and Answering Queries

Basic Concepts of Data Warehousing and OLTP/OLAP Systems

Understanding Data Warehousing and OLAP with Documentary Evidence from Docsity.com

Dimensional Queries in Data Warehousing and Data Mining

Operational Data Stores and Data Warehouses - Prof. dfdf

Data Warehousing-Database Management Systems-Lecture 26 Slides-Computer Science

Data Warehousing: Online Transaction Processing vs. Online Analytic Processing

(1)

Data Warehousing Fundamentals and Concepts

Data Warehouse Schema and OLAP Operations

Data Warehousing and OLAP - Outline | CS 4440

Lecture Notes on OLAP and Data Warehousing

Partial preview of the text

Download Data Warehousing: Understanding Data Integration, Star Schemas, and OLAP Queries and more Exams Data Mining in PDF only on Docsity!

Data Warehousing

Outline

• Data integration

• Basic OLAP queries

– Data cubes

– Slice and dice, drill down, roll up

– MOLAP vs. ROLAP

– SQL OLAP Extensions: ROLLUP, CUBE

• Star Schemas

– Facts and Dimensions

Terminology: ETL

• ETL = Extraction, Transformation, & Load

• Extraction: Get the data out of the source

systems

• Transformation: Convert the data into a useful

format for analysis

• Load: Get the data into the data warehouse

(…and build indexes, materialized views, etc.)

Data Integration is Hard

Data warehouses combine data from multiple sources
(^) Data must be translated into a consistent format
(^) Data integration represents ~80% of effort for a typical

data warehouse project!

(^) Some reasons why it’s hard:
- (^) Metadata is often poor or non-existent
- (^) Data quality is often bad
  - Missing or default values
  - Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California)
- (^) Inconsistent semantics
  - What is an airline passenger?

Warehouses vs. Federation

(^) Advantages of federated databases:
- (^) No redundant copying of data
- (^) Queries see “real-time” view of evolving data
- (^) More flexible security policy
(^) Disadvantages of federated databases:
- (^) Analysis queries place extra load on transactional systems
- (^) Query optimization is hard to do well
- (^) Historical data may not be available
- (^) Complex “wrappers” needed to mediate between analysis server and source systems
(^) Data warehouses are much more common in practice
- (^) Better performance
- (^) Lower complexity
- (^) Slightly out-of-date data is acceptable

Two Approaches to Data Warehousing

Data mart: like a data warehouse, but smaller and more

focused

(^) Top-down approach
- (^) First build single unified data warehouse with all enterprise data
- Then create data marts containing specialized subsets of the data from the warehouse
(^) Bottom-up approach
- First build a data mart to solve the most pressing problem
- (^) Then build another data mart, then another
- (^) Data warehouse = union of all data marts
(^) In practice, not much difference between the two
(^) Our book advocates the bottom-up approach

Slicing and Dicing

Jul Aug Sep

CA

OR

WA

Red Blue Gray Red Blue Gray Jul Aug Sep

CA

OR

WA

Blue Jul Aug Sep

CA

OR

WA

Blue Jul Aug Sep Total

Querying the Data Cube

(^) Cross-tabulation
- (^) “Cross-tab” for short
- (^) Report data grouped by 2 dimensions
- (^) Aggregate across other dimensions
- (^) Include subtotals
(^) Operations on a cross-tab
- (^) Roll up (further aggregation)
- (^) Drill down (less aggregation) CA OR WA Total Jul 45 33 30 108 Aug 50 36 42 128 Sep 38 31 40 109 Total 133 100 112 345 Number of Autos Sold

“Standard” Data Cube Query

• Measurements

(^) Which fact(s) should be reported?

• Filters

(^) What slice(s) of the cube should be used?

• Grouping attributes

(^) How finely should the cube be diced?
(^) Each dimension is either:
- (^) (a) A grouping attribute
- (^) (b) Aggregated over (“Rolled up” into a single total)
(^) n dimensions → 2 n^ sets of grouping attributes
(^) Aggregation = projection to a lower-dimensional

subspace

Full Data Cube with Subtotals

• Pre-computation of aggregates → fast answers

to OLAP queries

• Ideally, pre-compute all 2n^ types of subtotals

• Otherwise, perform aggregation as needed

• Coarser-grained totals can be computed from

finer-grained totals

(^) But not the other way around

MOLAP vs. ROLAP

• MOLAP = Multidimensional OLAP

• Store data cube as multidimensional array

• (Usually) pre-compute all aggregates

• Advantages:

– Very efficient data access → fast answers

• Disadvantages:

– Doesn’t scale to large numbers of dimensions

– Requires special-purpose data store

Sparsity

Imagine a data warehouse for Safeway.
(^) Suppose dimensions are: Customer, Product, Store, Day
(^) If there are 100,000 customers, 10,000 products, 1,

stores, and 1,000 days…

(^) …data cube has 1,000,000,000,000,000 cells!
(^) Fortunately, most cells are empty.
(^) A given store doesn’t sell every product on every day.
(^) A given customer has never visited most of the stores.
(^) A given customer has never purchased most products.
Multi-dimensional arrays are not an efficient way to store

sparse data.

Creating a Cross-tab with SQL

SELECT state, month, SUM(quantity)

FROM sales

GROUP BY state, month

WHERE color = 'Red'

Grouping

Attributes

Measurements

Filters

What about the totals?

(^) SQL aggregation query

with GROUP BY does not

produce subtotals, totals

(^) Our cross-tab report is

incomplete.

CA OR WA Total Jul 45 33 30? Aug 50 36 42? Sep 38 31 40? Total???? Number of Autos Sold State Month SUM CA Jul 45 CA Aug 50 CA Sep 38 OR Jul 33 OR Aug 36 OR Sep 31 WA Jul 30 WA Aug 42 WA Sep 40