Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Minining - Data reduction, Study notes of Data Mining

Moradabad Institute of Technology Data Mining

In this document topics covered which are Cluster or Stratified Sampling, Sampling: with or without Replacement, Data Reduction Method (4): Sampling, Data Reduction Method (3): Clustering, Regress Analysis and Log-Linear Models, Multiple regression.

Typology: Study notes

2010/2011

Uploaded on 09/04/2011

amit-mohta 🇮🇳

4.2

(152)

89 documents

1 / 22

This page cannot be seen from the preview

Don't miss anything!

November 27, 2014 Data Mining: Concepts and

Techniques 1

CHAPTER 2: Data

Preprocessing

•Why preprocess the data?

•Data cleaning

•Data integration and transformation

•Data reduction

•Discretization and concept hierarchy

generation

•Summary

Partial preview of the text

Download Data Minining - Data reduction and more Study notes Data Mining in PDF only on Docsity!

November 27, 2014 Data Mining: Concepts and 1

CHAPTER 2: Data

Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy

generation

• Summary

November 27, 2014 Data Mining: Concepts and 2

Data Reduction Strategies

(^) Why data reduction?
- (^) A database/data warehouse may store terabytes of data
- (^) Complex data analysis/mining may take a very long time to run on the complete data set
(^) Data reduction
- (^) Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
(^) Data reduction strategies
- (^) Data cube aggregation:
- (^) Dimensionality reduction — e.g., remove unimportant attributes
- (^) Data Compression
- (^) Numerosity reduction — e.g., fit data into models
- (^) Discretization and concept hierarchy generation

November 27, 2014 Data Mining: Concepts and 4 Attribute Subset Selection

(^) Feature selection (i.e., attribute subset selection):
- (^) Select a minimum set of features such that the

probability distribution of different classes given the

values for those features is as close as possible to

the original distribution given the values of all

features

(^) reduce # of patterns in the patterns, easier to

understand

(^) Heuristic methods (due to exponential # of choices):
- (^) Step-wise forward selection : The procedure

starts with an empty set of attributes as the reduced

set. The best of the original attributes is determined

and added to the reduced set, which continue till the

best of remaining original attributes is added to the

set.

November 27, 2014 Data Mining: Concepts and 5

(^) Step-wise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attributes remaining in the set.
(^) Combining forward selection and backward elimination: It is the combination of above two approaches so that, at each step, the procedures selects the best attribute and removes the worst from among the remaining attributes.
(^) Decision-tree induction: It construct a FC like structure where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. When DTI is used for attribute subset selection, a tree is constructed from the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree from the reduced subset of

attributes.

November 27, 2014 Data Mining: Concepts and 7 Heuristic Feature Selection Methods

(^) There are 2 d^ possible sub-features of d features
(^) Several heuristic feature selection methods:
- (^) Best single features under the feature independence assumption: choose by significance tests
- (^) Best step-wise feature selection:
  - (^) The best single-feature is picked first
  - (^) Then next best feature condition to the first, ...
- (^) Step-wise feature elimination:
  - (^) Repeatedly eliminate the worst feature
- (^) Best combined feature selection and elimination
- (^) Optimal branch and bound:
  - (^) Use feature elimination and backtracking

November 27, 2014 Data Mining: Concepts and 8

Data Compression

(^) String compression
- (^) There are extensive theories and well-tuned

algorithms

(^) Typically lossless
(^) But only limited manipulation is possible without

expansion

(^) Audio/video compression
- (^) Typically lossy compression, with progressive

refinement

(^) Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

(^) Time sequence is not audio
- (^) Typically short and vary slowly with time

November 27, 2014 Data Mining: Concepts and 10 Dimensionality Reduction: Wavelet Transformation In Dimensionality reduction , data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data. If the original data can be reconstructed from the compressed data without any loss of information, the data reduction is called lossless. If instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. Two effective methods of lossy dimensionality reduction are :

(^) Principal Components Analysis
(^) Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis
(^) Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
(^) Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Haar2 (^) Daubechie

November 27, 2014 Data Mining: Concepts and 11

Method : The general procedure for applying a discrete wavelet transform uses hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. The method is as follows :- - (^) Length, L of input data vector, must be an integer power of 2 (padding with 0’s, when necessary) - (^) Each transform has 2 functions: smoothing, difference - (^) Applies to pairs of data, resulting in two set of data of length L/ - (^) Applies two functions recursively, until reaches the desired length Application : Compression of finger print images, computer vision, analysis of time-series data.

Dimensionality Reduction:

Wavelet Transformation

November 27, 2014 Data Mining: Concepts and 13

(^) Given N data vectors from n -dimensions, find k ≤ n orthogonal vectors ( principal components ) that can be best used to represent data
(^) Steps
- (^) Normalize input data: Each attribute falls within the same range
- (^) Compute k orthonormal (unit) vectors, i.e., principal components
- (^) Each input data (vector) is a linear combination of the k principal component vectors
- (^) The principal components are sorted in order of decreasing “significance” or strength
- (^) Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data
(^) Works for numeric data only
(^) Used when the number of dimensions is large Dimensionality Reduction: Principal Component Analysis (PCA)

November 27, 2014 Data Mining: Concepts and 14 X X Y Y

Principal Component

Analysis

November 27, 2014 Data Mining: Concepts and 16 Data Reduction Method (1): Regression and Log-Linear Models

(^) Linear regression: Data are modeled to fit a straight

line

(^) Often uses the least-square method to fit the line
(^) Multiple regression: allows a response variable Y to

be modeled as a linear function of multidimensional

feature vector

(^) Log-linear model: approximates discrete

multidimensional probability distributions

(^) Linear regression : Y = w X + b
- (^) Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand
- (^) Using the least squares criterion to the known values of Y 1 , Y 2 , …, X 1 , X 2 , ….
(^) Multiple regression : Y = b0 + b1 X1 + b2 X2.
- (^) Many nonlinear functions can be transformed into the above
(^) Log-linear models:
- (^) The multi-way table of joint probabilities is approximated by a product of lower-order tables
- (^) Probability: p(a, b, c, d) =  ab  ac  ad  bcd Regress Analysis and Log-Linear Models

November 27, 2014 Data Mining: Concepts and 19 Data Reduction Method (3): Clustering

(^) Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
(^) Can be very effective if data is clustered but not if data is “smeared”
(^) Can have hierarchical clustering and be stored in multi- dimensional index tree structures
(^) There are many choices of clustering definitions and clustering algorithms.

November 27, 2014 Data Mining: Concepts and 20

Data Reduction Method (4):

Sampling

(^) Sampling: obtaining a small sample s to represent the

whole data set N

(^) Allow a mining algorithm to run in complexity that is

potentially sub-linear to the size of the data

(^) Choose a representative subset of the data
- (^) Simple random sampling may have very poor

performance in the presence of skew

(^) Develop adaptive sampling methods
- (^) Stratified sampling:
  - (^) Approximate the percentage of each class (or

subpopulation of interest) in the overall database

(^) Used in conjunction with skewed data
(^) Note: Sampling may not reduce database I/Os (page at

Data Minining - Data reduction, Study notes of Data Mining

Related documents

Partial preview of the text

Download Data Minining - Data reduction and more Study notes Data Mining in PDF only on Docsity!

CHAPTER 2: Data

Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy

generation

• Summary

Data Reduction Strategies

probability distribution of different classes given the

values for those features is as close as possible to

the original distribution given the values of all

features

understand

starts with an empty set of attributes as the reduced

set. The best of the original attributes is determined

and added to the reduced set, which continue till the

best of remaining original attributes is added to the

set.

attributes.

Data Compression

algorithms

expansion

refinement

reconstructed without reconstructing the whole

Dimensionality Reduction:

Wavelet Transformation

Principal Component

Analysis

line

be modeled as a linear function of multidimensional

feature vector

multidimensional probability distributions

Data Reduction Method (4):

Sampling

whole data set N

potentially sub-linear to the size of the data

performance in the presence of skew

subpopulation of interest) in the overall database

a time)