Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Partitioning and Hierarchical Methods for Clustering, Summaries of Data Warehousing

The k-means clustering algorithm, which is an unsupervised machine learning-based algorithm that groups similar objects together to form clusters. The algorithm works by minimizing the sum of squares of distances between data and the corresponding cluster centroid. The document also highlights some issues with the k-means algorithm, such as its sensitivity to outliers and its restriction to continuous data.

Typology: Summaries

2021/2022

Available from 09/24/2023

jitendra-kumar-28
jitendra-kumar-28 🇮🇳

1 document

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Partitioning and Hierarchical
Methods for Clustering
Topic
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Partitioning and Hierarchical Methods for Clustering and more Summaries Data Warehousing in PDF only on Docsity!

Partitioning and Hierarchical

Methods for Clustering

Topic

● Cluster Analysis is the process of grouping a set of similar objects in order to form clusters. ● A group of data points would comprise together to form a cluster in which all the objects would belong to the same group. ● This is done so that records within a cluster (intra- cluster) have high similarity with one another but have high dissimilarities in comparison to objects in other clusters (inter-cluster) ● It is an unsupervised machine learning-based algorithm that acts on unlabelled data.

Clustering

. Partitioning Methods.

It constructs random partitions and then iteratively refines them by some criterion. ● Partitioning methods are also called centroid-based clustering, each cluster is represented by a central vector, which is not necessarily a member of the data set. ● The optimization problem(to iteratively refine centroid selection) itself is known to be NP- hard, and thus the common approach is to search only for approximate solutions. ● A particularly well known approximate method is Lloyd's algorithm often just referred to as " k-means algorithm "

The grouping of objects is done by minimizing the sum of squares of distances, i.e., a Euclidean distance between data and the corresponding cluster centroid. After grouping in done once , it is further optimized to achieve a better model that fits data to utmost efficiency. In the k-means clustering algorithm, n objects are clustered into k clusters or partitions on the basis of attributes, where k < n and k is a positive integer number. In simple words, in k-means clustering algorithm, the objects are grouped into ‘k’ number of clusters on the basis of attributes or features. What? k-means is one of the simplest algorithm which uses unsupervised learning method to be used in data mining. It works really well with large datasets with unlabelled data. Why? K-means Clustering How?

The Algorithm

Understanding through an example

The clusters obtained are: {1,2} and {3,4,5,6,7}

The clusters obtained are: {1,2} and {3,4,5,6,7}

● Instances in Cluster 1, i.e., C1 are 1, 5, 6 and 7
● Instances in Cluster 2, i.e., C2 are 2 and 8
● Instances in Cluster 3, i.e., C3 are 3, 4, 9 and 10
The first iteration results in four, two and four
students in the first, second and third cluster,
respectively.
Again
● Instances in Cluster 1, i.e., C1 are
1, 5, 6 and 7
● Instances in Cluster 2, i.e., C2 are 2
and 8
● Instances in Cluster 3, i.e., C3 are
3, 4, 9 and 10 The first iteration
results in four, two and four students
in the first, second and third cluster,
respectively.

. Hierarchical Methods.

It creates a hierarchical decomposition of the set of data (or objects) using some criterion. ● Hierarchical clustering is a type of cluster analysis which seeks to generate a hierarchy of clusters. It is also called Hierarchical Cluster Analysis (HCA). ● Hierarchical clustering methods build a nested series of clusters in comparison to partitioned methods that generate only a flat set of clusters. ● There are two types of Hierarchical clustering: agglomerative and divisive. ● Agglomerative is a ‘bottom-up’ approach. ● In this approach, each object is a cluster by itself at the start and its nearby clusters are repetitively combined resulting in larger and larger clusters until some stopping criterion is met. ● The stopping criterion may be the specified number of clusters or a stage at which all the objects are combined into a single large cluster that is the highest level of hierarchy

In agglomerative clustering, suppose a set of N items are given to be clustered, the procedure of clustering is given as follows: ● First of all, each item is allocated to a cluster. For example, if there are N items then, there will be N clusters, i.e., each cluster consisting of just one item. Assume that the distances (similarities) between the clusters are same as the distances (similarities) between the items they contain. ● Identify the nearest (most similar) pair of clusters and merge that pair into a single cluster, so that now we will have one cluster less. ● Now, the distances (similarities) between each of the old clusters and the new cluster are calculated. ● Repeat steps 2 and 3 until all items are merged into a single cluster of size N. As this type of hierarchical clustering merges clusters recursively, it is therefore known as agglomerative( the action or process of collecting in a mass).

Agglomerative:The Algorithm

It is required to compute the distance between clusters by using some metrics before any clustering is performed. These metrics are called linkage metrics.

Linkage metrics

Single Linkage Average Linkage Complete Linkage

Understanding through an example