Download Data Minining - Data reduction and more Study notes Data Mining in PDF only on Docsity!
November 27, 2014 Data Mining: Concepts and 1
CHAPTER 2: Data
Preprocessing
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy
generation
• Summary
November 27, 2014 Data Mining: Concepts and 2
Data Reduction Strategies
- (^) Why data reduction?
- (^) A database/data warehouse may store terabytes of data
- (^) Complex data analysis/mining may take a very long time to run on the complete data set
- (^) Data reduction
- (^) Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
- (^) Data reduction strategies
- (^) Data cube aggregation:
- (^) Dimensionality reduction — e.g., remove unimportant attributes
- (^) Data Compression
- (^) Numerosity reduction — e.g., fit data into models
- (^) Discretization and concept hierarchy generation
November 27, 2014 Data Mining: Concepts and 4 Attribute Subset Selection
- (^) Feature selection (i.e., attribute subset selection):
- (^) Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to
the original distribution given the values of all
features
- (^) reduce # of patterns in the patterns, easier to
understand
- (^) Heuristic methods (due to exponential # of choices):
- (^) Step-wise forward selection : The procedure
starts with an empty set of attributes as the reduced
set. The best of the original attributes is determined
and added to the reduced set, which continue till the
best of remaining original attributes is added to the
set.
November 27, 2014 Data Mining: Concepts and 5
- (^) Step-wise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attributes remaining in the set.
- (^) Combining forward selection and backward elimination: It is the combination of above two approaches so that, at each step, the procedures selects the best attribute and removes the worst from among the remaining attributes.
- (^) Decision-tree induction: It construct a FC like structure where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. When DTI is used for attribute subset selection, a tree is constructed from the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree from the reduced subset of
attributes.
November 27, 2014 Data Mining: Concepts and 7 Heuristic Feature Selection Methods
- (^) There are 2 d^ possible sub-features of d features
- (^) Several heuristic feature selection methods:
- (^) Best single features under the feature independence assumption: choose by significance tests
- (^) Best step-wise feature selection:
- (^) The best single-feature is picked first
- (^) Then next best feature condition to the first, ...
- (^) Step-wise feature elimination:
- (^) Repeatedly eliminate the worst feature
- (^) Best combined feature selection and elimination
- (^) Optimal branch and bound:
- (^) Use feature elimination and backtracking
November 27, 2014 Data Mining: Concepts and 8
Data Compression
- (^) String compression
- (^) There are extensive theories and well-tuned
algorithms
- (^) Typically lossless
- (^) But only limited manipulation is possible without
expansion
- (^) Audio/video compression
- (^) Typically lossy compression, with progressive
refinement
- (^) Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
- (^) Time sequence is not audio
- (^) Typically short and vary slowly with time
November 27, 2014 Data Mining: Concepts and 10 Dimensionality Reduction: Wavelet Transformation In Dimensionality reduction , data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data. If the original data can be reconstructed from the compressed data without any loss of information, the data reduction is called lossless. If instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. Two effective methods of lossy dimensionality reduction are :
- (^) Principal Components Analysis
- (^) Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis
- (^) Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
- (^) Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Haar2 (^) Daubechie
November 27, 2014 Data Mining: Concepts and 11
- Method : The general procedure for applying a discrete wavelet transform uses hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. The method is as follows :- - (^) Length, L of input data vector, must be an integer power of 2 (padding with 0’s, when necessary) - (^) Each transform has 2 functions: smoothing, difference - (^) Applies to pairs of data, resulting in two set of data of length L/ - (^) Applies two functions recursively, until reaches the desired length Application : Compression of finger print images, computer vision, analysis of time-series data.
Dimensionality Reduction:
Wavelet Transformation
November 27, 2014 Data Mining: Concepts and 13
- (^) Given N data vectors from n -dimensions, find k ≤ n orthogonal vectors ( principal components ) that can be best used to represent data
- (^) Steps
- (^) Normalize input data: Each attribute falls within the same range
- (^) Compute k orthonormal (unit) vectors, i.e., principal components
- (^) Each input data (vector) is a linear combination of the k principal component vectors
- (^) The principal components are sorted in order of decreasing “significance” or strength
- (^) Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data
- (^) Works for numeric data only
- (^) Used when the number of dimensions is large Dimensionality Reduction: Principal Component Analysis (PCA)
November 27, 2014 Data Mining: Concepts and 14 X X Y Y
Principal Component
Analysis
November 27, 2014 Data Mining: Concepts and 16 Data Reduction Method (1): Regression and Log-Linear Models
- (^) Linear regression: Data are modeled to fit a straight
line
- (^) Often uses the least-square method to fit the line
- (^) Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
- (^) Log-linear model: approximates discrete
multidimensional probability distributions
- (^) Linear regression : Y = w X + b
- (^) Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand
- (^) Using the least squares criterion to the known values of Y 1 , Y 2 , …, X 1 , X 2 , ….
- (^) Multiple regression : Y = b0 + b1 X1 + b2 X2.
- (^) Many nonlinear functions can be transformed into the above
- (^) Log-linear models:
- (^) The multi-way table of joint probabilities is approximated by a product of lower-order tables
- (^) Probability: p(a, b, c, d) = ab ac ad bcd Regress Analysis and Log-Linear Models
November 27, 2014 Data Mining: Concepts and 19 Data Reduction Method (3): Clustering
- (^) Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
- (^) Can be very effective if data is clustered but not if data is “smeared”
- (^) Can have hierarchical clustering and be stored in multi- dimensional index tree structures
- (^) There are many choices of clustering definitions and clustering algorithms.
November 27, 2014 Data Mining: Concepts and 20
Data Reduction Method (4):
Sampling
- (^) Sampling: obtaining a small sample s to represent the
whole data set N
- (^) Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
- (^) Choose a representative subset of the data
- (^) Simple random sampling may have very poor
performance in the presence of skew
- (^) Develop adaptive sampling methods
- (^) Stratified sampling:
- (^) Approximate the percentage of each class (or
subpopulation of interest) in the overall database
- (^) Used in conjunction with skewed data
- (^) Note: Sampling may not reduce database I/Os (page at
a time)