Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining - Clustering Methods, Study notes of Data Mining

Detailed informtion about Cluster Analysis, Clustering High-Dimensional Data , Types of Data in Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based Methods.

Typology: Study notes

2010/2011

Uploaded on 09/03/2011

amit-mohta
amit-mohta 🇮🇳

4.2

(152)

89 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
November 21, 2014 Data Mining: Concepts and
Techniques 1
Chapter 7. Cluster
Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Data Mining - Clustering Methods and more Study notes Data Mining in PDF only on Docsity!

November 21, 2014

Data Mining: Concepts and

Chapter 7. Cluster

Analysis

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis

3. A Categorization of Major Clustering Methods

4. Partitioning Methods

5. Hierarchical Methods

6. Density-Based Methods

7. Grid-Based Methods

8. Model-Based Methods

9. Clustering High-Dimensional Data

10.Constraint-Based Clustering

11.Outlier Analysis

12.Summary

November 21, 2014

Data Mining: Concepts and

Major Clustering Approaches (I)

  • (^) Partitioning approach:
    • (^) Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
    • (^) Typical methods: k-means, k-medoids, CLARANS
  • (^) Hierarchical approach:
    • (^) Create a hierarchical decomposition of the set of data (or objects) using some criterion
    • (^) Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
  • (^) Density-based approach:
    • (^) Based on connectivity and density functions
    • (^) Typical methods: DBSACN, OPTICS, DenClue

November 21, 2014

Data Mining: Concepts and

Typical Alternatives to Calculate the Distance between Clusters

  • (^) Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
  • (^) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
  • (^) Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
  • (^) Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
  • (^) Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj) - (^) Medoid: one chosen, centrally located object in the cluster

November 21, 2014

Data Mining: Concepts and

Centroid, Radius and Diameter of a Cluster (for numerical data sets)

  • (^) Centroid: the “middle” of a cluster
  • (^) Radius: square root of average distance from any point of the cluster to its centroid
  • (^) Diameter: square root of average mean squared distance between all pairs of points in the cluster N t N i ip m C ( )  1   N m c ip t N i m R 2 ( ) 1     ( 1 ) 2 ( ) 1 1        N N iq t ip t N i N i m D

November 21, 2014

Data Mining: Concepts and

Partitioning Algorithms: Basic Concept

  • (^) Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance
  • (^) Given a k , find a partition of k clusters that optimizes the chosen partitioning criterion - (^) Global optimal: exhaustively enumerate all partitions - (^) Heuristic methods: k-means and k-medoids algorithms - (^) k-means (MacQueen’67): Each cluster is represented by the center of the cluster - (^) k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 2 1 ( ) t Km m mi k m C t

mi

    

November 21, 2014

Data Mining: Concepts and

The K-Means Clustering Method

  • (^) Given k , the k-means algorithm is implemented in four steps: - (^) Partition objects into k nonempty subsets - (^) Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point , of the cluster) - (^) Assign each object to the cluster with the nearest seed point - (^) Go back to Step 2, stop when no more new assignment

November 21, 2014

Data Mining: Concepts and

Comments on the K-Means Method

  • (^) Strength: Relatively efficient : O ( tkn ), where n is # objects, k is

    clusters, and t is # iterations. Normally, k , t << n.

     - (^) Comparing: PAM: O(k(n-k)^2 ), CLARA: O(ks^2 + k(n-k)) 
  • (^) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
  • (^) Weakness
    • (^) Applicable only when mean is defined, then what about categorical data?
    • (^) Need to specify k, the number of clusters, in advance
    • (^) Unable to handle noisy data and outliers
    • (^) Not suitable to discover clusters with non-convex shapes

November 21, 2014

Data Mining: Concepts and

Variations of the K-Means Method

  • (^) A few variants of the k-means which differ in
    • (^) Selection of the initial k means
    • (^) Dissimilarity calculations
    • (^) Strategies to calculate cluster means
  • (^) Handling categorical data: k-modes (Huang’98)
    • (^) Replacing means of clusters with modes
    • (^) Using new dissimilarity measures to deal with categorical objects
    • (^) Using a frequency-based method to update modes of clusters
    • (^) A mixture of categorical and numerical data: k-prototype method

November 21, 2014

Data Mining: Concepts and

K-Means Algorithm

November 21, 2014

Data Mining: Concepts and

K-Means Example  (^) Given: {2,4,10,12,3,20,30,11,25}, k=  (^) Randomly assign means: m 1 =3,m 2 =  (^) K 1 ={2,3},^ K 2 ={4,10,12,20,30,11,25}, m 1 =2.5, m 2 =  (^) K 1 ={2,3,4},^ K 2 ={10,12,20,30,11,25}, m 1 =3, m 2 =  (^) K 1 ={2,3,4,10},^ K 2 ={12,20,30,11,25}, m 1 =4.75 , m 2 =19.  (^) K 1 ={2,3,4,10,11,12},^ K 2 ={20,30,25}, m 1 =7, m 2 =  (^) Stop as the clusters with these means are the same.

November 21, 2014

Data Mining: Concepts and

A Typical K-Medoids Algorithm (PAM) 0

Total Cost = 20 0

K= Arbitrar y choose k object as initial medoid s

Assign each remaini ng object to nearest medoid s Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0

Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0

November 21, 2014

Data Mining: Concepts and

PAM (Partitioning Around Medoids) (1987)

  • (^) PAM (Kaufman and Rousseeuw, 1987), built in Splus
  • (^) Use real object to represent the cluster
    • (^) Select k representative objects arbitrarily
    • (^) For each pair of non-selected object h and selected object i , calculate the total swapping cost TCih
    • (^) For each pair of i and h ,
      • If TCih < 0, i is replaced by h
      • (^) Then assign each non-selected object to the most similar representative object
    • (^) repeat steps 2-3 until there is no change

November 21, 2014

Data Mining: Concepts and

What Is the Problem with

PAM?

• Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less

influenced by outliers or other extreme values than a

mean

• Pam works efficiently for small data sets but does not

scale well for large data sets.

– O(k(n-k)^2 ) for each iteration

where n is # of data,k is # of clusters

 Sampling based method,

CLARA(Clustering LARge Applications)

November 21, 2014

Data Mining: Concepts and

CLARA (Clustering Large Applications) (1990)

  • (^) CLARA (Kaufmann and Rousseeuw in 1990)
    • (^) Built in statistical analysis packages, such as S+
  • (^) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
  • (^) Strength: deals with larger data sets than PAM
  • (^) Weakness:
    • (^) Efficiency depends on the sample size
    • (^) A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased