Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Cluster Analysis: Basic Concepts and Methods, Cheat Sheet of English Literature

MIT - World Peace University English Literature

A comprehensive introduction to cluster analysis, a fundamental data mining technique used to group similar data objects together. It covers basic concepts, partitioning methods, hierarchical methods, and evaluation of clustering. The document also includes examples and exercises to illustrate the concepts and methods discussed.

Typology: Cheat Sheet

2023/2024

Uploaded on 11/11/2024

the-super-world 🇮🇳

2 documents

1 / 61

This page cannot be seen from the preview

Don't miss anything!

Cluster Analysis: Basic Concepts and

Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Summary

Partial preview of the text

Download Cluster Analysis: Basic Concepts and Methods and more Cheat Sheet English Literature in PDF only on Docsity!

Cluster Analysis: Basic Concepts and

Methods



Cluster Analysis: Basic Concepts



Partitioning Methods



Hierarchical Methods



Summary

What is Cluster Analysis? 

Cluster: A collection of data objects



similar (or related) to one another within the same group



dissimilar (or unrelated) to the objects in other groups



Cluster analysis (or clustering , data segmentation, … )



Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters



Unsupervised learning: no predefined classes (i.e., learning

by observations vs. learning by examples: supervised)



Typical applications



As a stand-alone tool to get insight into data distribution



As a preprocessing step for other algorithms

Clustering as a Preprocessing Tool

(Utility)



Summarization :



Preprocessing for regression, PCA, classification, and

association analysis



Compression :



Image processing: vector quantization



Finding K-Nearest Neighbors



Localizing search to one or a small number of clusters



Outlier detection



Outliers are often viewed as those “far away” from any

cluster

Quality: What Is Good Clustering? 

A good clustering method will produce high quality

clusters



high intra-classintra-class similarity: cohesive within clusters



low inter-classinter-class similarity: distinctive between clusters



Q uality of a clustering method depends on



similarity measure used by the method



its implementation, and



Its ability to discover some or all of the hidden patterns

Considerations for Cluster Analysis  (^) Partitioning criteria  Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)  (^) Separation of clusters  Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class)  (^) Similarity measure  Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)  (^) Clustering space  (^) Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

Requirements and Challenges  (^) Scalability  (^) Clustering all the data instead of only on samples  (^) Ability to deal with different types of attributes  Numerical, binary, categorical, ordinal, linked, and mixture of these  (^) Constraint-based clustering  (^) User may give inputs on constraints  Use domain knowledge to determine input parameters  (^) Interpretability and usability  (^) Others  Discovery of clusters with arbitrary shape  (^) Ability to deal with noisy data  (^) Incremental clustering and insensitivity to input order  High dimensionality

Major Clustering Approaches

(II)

 (^) Model-based:  A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other  Typical methods: EM, SOM, COBWEB  (^) Frequent pattern-based:  Based on the analysis of frequent patterns  Typical methods: p-Cluster  (^) User-guided or constraint-based:  Clustering by considering user-specified or application-specific constraints  (^) Typical methods: COD (obstacles), constrained clustering  (^) Link-based clustering:  (^) Objects are often linked together in various ways  (^) Massive links can be used to cluster objects: SimRank, LinkClus

November 11, 2024 Data Mining: Concepts and Techniques 11

Alternatives to Calculate the Distance

between Clusters



Single link: smallest distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster and

an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq) 

Average: avg distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dis(K

i, Kj) = dis(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dis(K

i, Kj) = dis(Mi, Mj)  (^) Medoid: one chosen, centrally located object in the cluster

Cluster Analysis: Basic Concepts and

Methods



Cluster Analysis: Basic Concepts



Partitioning Methods



Hierarchical Methods



Evaluation of Clustering



Summary

Partitioning Algorithms: Basic

Concept

 (^) Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)  (^) Given k , find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  (^) Heuristic methods: k-means and k-medoids algorithms  (^) k-means : Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) : Each cluster is represented by one of the objects in the cluster 2 1 ( ) p C i k i E p c i     

An Example of K-Means Clustering K= Arbitrarily choose k objects as centroids and assign each object to nearest centroid Compute new centroids for each cluster Update the cluster centroids Reassign objects Loop if needed The initial data set

Example 

Suppose that the following items are given to

form cluster with k=2, using Manhattan distance



Initial centroids are 2 and 4.

November 11, 2024 Data Mining: Concepts and Techniques 17

Exercise 

Suppose that the data mining task is to cluster

the following eight points into three clusters



A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5),

B3(6,4), C1(1,2), C2(4,9)



The distance is Euclidean distance.



Let A1, B1, C1 be the initial centroids.



Use K-mean algorithm to find the final three

clusters.

November 11, 2024 Data Mining: Concepts and Techniques 19

Comments on the K-Means Method

 (^) Strength: Efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n.  (^) Comparing: PAM: O(k(n-k) (^2) )  (^) Comment: Often terminates at a local optimal.  Weakness  Applicable only to objects in a continuous n-dimensional space  Using the k-modes method for categorical data  In comparison, k-medoids can be applied to a wide range of data  (^) Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k  (^) Sensitive to noisy data and outliers  (^) Not suitable to discover clusters with non-convex shapes

Cluster Analysis: Basic Concepts and Methods, Cheat Sheet of English Literature

Related documents

Partial preview of the text

Download Cluster Analysis: Basic Concepts and Methods and more Cheat Sheet English Literature in PDF only on Docsity!

Cluster Analysis: Basic Concepts and

Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Summary

Cluster: A collection of data objects

similar (or related) to one another within the same group

dissimilar (or unrelated) to the objects in other groups

Cluster analysis (or clustering , data segmentation, … )

Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters

Unsupervised learning: no predefined classes (i.e., learning

by observations vs. learning by examples: supervised)

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

Clustering as a Preprocessing Tool

(Utility)

Summarization :

Preprocessing for regression, PCA, classification, and

association analysis

Compression :

Image processing: vector quantization

Finding K-Nearest Neighbors

Localizing search to one or a small number of clusters

Outlier detection

Outliers are often viewed as those “far away” from any

cluster

A good clustering method will produce high quality

clusters

high intra-classintra-class similarity: cohesive within clusters

low inter-classinter-class similarity: distinctive between clusters

Q uality of a clustering method depends on

similarity measure used by the method

its implementation, and

Its ability to discover some or all of the hidden patterns

Major Clustering Approaches

(II)

Alternatives to Calculate the Distance

between Clusters

Single link: smallest distance between an element in one cluster and an

 Complete link: largest distance between an element in one cluster and

Average: avg distance between an element in one cluster and an

 Centroid: distance between the centroids of two clusters, i.e., dis(K

 Medoid: distance between the medoids of two clusters, i.e., dis(K

Cluster Analysis: Basic Concepts and

Methods

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Evaluation of Clustering

Summary

Partitioning Algorithms: Basic

Concept

Suppose that the following items are given to

form cluster with k=2, using Manhattan distance

Initial centroids are 2 and 4.

Suppose that the data mining task is to cluster

the following eight points into three clusters

A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5),

B3(6,4), C1(1,2), C2(4,9)

The distance is Euclidean distance.

Let A1, B1, C1 be the initial centroids.

Use K-mean algorithm to find the final three

clusters.

Comments on the K-Means Method