Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Cluster Analysis: Basic Concepts and Methods, Cheat Sheet of English Literature

A comprehensive introduction to cluster analysis, a fundamental data mining technique used to group similar data objects together. It covers basic concepts, partitioning methods, hierarchical methods, and evaluation of clustering. The document also includes examples and exercises to illustrate the concepts and methods discussed.

Typology: Cheat Sheet

2023/2024

Uploaded on 11/11/2024

the-super-world
the-super-world 🇮🇳

2 documents

1 / 61

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Summary
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d

Partial preview of the text

Download Cluster Analysis: Basic Concepts and Methods and more Cheat Sheet English Literature in PDF only on Docsity!

Cluster Analysis: Basic Concepts and

Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Summary

What is Cluster Analysis? 

Cluster: A collection of data objects

similar (or related) to one another within the same group

dissimilar (or unrelated) to the objects in other groups

Cluster analysis (or clustering , data segmentation, … )

Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters

Unsupervised learning: no predefined classes (i.e., learning

by observations vs. learning by examples: supervised)

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

Clustering as a Preprocessing Tool

(Utility)

Summarization :

Preprocessing for regression, PCA, classification, and

association analysis

Compression :

Image processing: vector quantization

Finding K-Nearest Neighbors

Localizing search to one or a small number of clusters

Outlier detection

Outliers are often viewed as those “far away” from any

cluster

Quality: What Is Good Clustering? 

A good clustering method will produce high quality

clusters

high intra-classintra-class similarity: cohesive within clusters

low inter-classinter-class similarity: distinctive between clusters

Q uality of a clustering method depends on

similarity measure used by the method

its implementation, and

Its ability to discover some or all of the hidden patterns

Considerations for Cluster Analysis  (^) Partitioning criteria  Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)  (^) Separation of clusters  Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class)  (^) Similarity measure  Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)  (^) Clustering space  (^) Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

Requirements and Challenges  (^) Scalability  (^) Clustering all the data instead of only on samples  (^) Ability to deal with different types of attributes  Numerical, binary, categorical, ordinal, linked, and mixture of these  (^) Constraint-based clustering  (^) User may give inputs on constraints  Use domain knowledge to determine input parameters  (^) Interpretability and usability  (^) Others  Discovery of clusters with arbitrary shape  (^) Ability to deal with noisy data  (^) Incremental clustering and insensitivity to input order  High dimensionality

Major Clustering Approaches

(II)

 (^) Model-based:  A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other  Typical methods: EM, SOM, COBWEB  (^) Frequent pattern-based:  Based on the analysis of frequent patterns  Typical methods: p-Cluster  (^) User-guided or constraint-based:  Clustering by considering user-specified or application-specific constraints  (^) Typical methods: COD (obstacles), constrained clustering  (^) Link-based clustering:  (^) Objects are often linked together in various ways  (^) Massive links can be used to cluster objects: SimRank, LinkClus

November 11, 2024 Data Mining: Concepts and Techniques 11

Alternatives to Calculate the Distance

between Clusters

Single link: smallest distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster and

an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq) 

Average: avg distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dis(K

i, Kj) = dis(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dis(K

i, Kj) = dis(Mi, Mj)  (^) Medoid: one chosen, centrally located object in the cluster

Cluster Analysis: Basic Concepts and

Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Evaluation of Clustering

Summary

Partitioning Algorithms: Basic

Concept

 (^) Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)  (^) Given k , find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  (^) Heuristic methods: k-means and k-medoids algorithms  (^) k-means : Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) : Each cluster is represented by one of the objects in the cluster 2 1 ( ) p C i k i E p c i     

An Example of K-Means Clustering K= Arbitrarily choose k objects as centroids and assign each object to nearest centroid Compute new centroids for each cluster Update the cluster centroids Reassign objects Loop if needed The initial data set

Example 

Suppose that the following items are given to

form cluster with k=2, using Manhattan distance

Initial centroids are 2 and 4.

November 11, 2024 Data Mining: Concepts and Techniques 17

Exercise 

Suppose that the data mining task is to cluster

the following eight points into three clusters

A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5),

B3(6,4), C1(1,2), C2(4,9)

The distance is Euclidean distance.

Let A1, B1, C1 be the initial centroids.

Use K-mean algorithm to find the final three

clusters.

November 11, 2024 Data Mining: Concepts and Techniques 19

Comments on the K-Means Method

 (^) Strength: Efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n.  (^) Comparing: PAM: O(k(n-k) (^2) )  (^) Comment: Often terminates at a local optimal.  Weakness  Applicable only to objects in a continuous n-dimensional space  Using the k-modes method for categorical data  In comparison, k-medoids can be applied to a wide range of data  (^) Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k  (^) Sensitive to noisy data and outliers  (^) Not suitable to discover clusters with non-convex shapes