Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Partitioning and Hierarchical Methods for Clustering, Summaries of Data Warehousing

Central University of Haryana Data Warehousing

The k-means clustering algorithm, which is an unsupervised machine learning-based algorithm that groups similar objects together to form clusters. The algorithm works by minimizing the sum of squares of distances between data and the corresponding cluster centroid. The document also highlights some issues with the k-means algorithm, such as its sensitivity to outliers and its restriction to continuous data.

Typology: Summaries

2021/2022

Available from 09/24/2023

jitendra-kumar-28 🇮🇳

1 document

1 / 24

This page cannot be seen from the preview

Don't miss anything!

Partitioning and Hierarchical

Methods for Clustering

Topic

Partial preview of the text

Download Partitioning and Hierarchical Methods for Clustering and more Summaries Data Warehousing in PDF only on Docsity!

Partitioning and Hierarchical

Methods for Clustering

Topic

● Cluster Analysis is the process of grouping a set of similar objects in order to form clusters. ● A group of data points would comprise together to form a cluster in which all the objects would belong to the same group. ● This is done so that records within a cluster (intra- cluster) have high similarity with one another but have high dissimilarities in comparison to objects in other clusters (inter-cluster) ● It is an unsupervised machine learning-based algorithm that acts on unlabelled data.

Clustering

. Partitioning Methods.

It constructs random partitions and then iteratively refines them by some criterion. ● Partitioning methods are also called centroid-based clustering, each cluster is represented by a central vector, which is not necessarily a member of the data set. ● The optimization problem(to iteratively refine centroid selection) itself is known to be NP- hard, and thus the common approach is to search only for approximate solutions. ● A particularly well known approximate method is Lloyd's algorithm often just referred to as " k-means algorithm "

The grouping of objects is done by minimizing the sum of squares of distances, i.e., a Euclidean distance between data and the corresponding cluster centroid. After grouping in done once , it is further optimized to achieve a better model that fits data to utmost efficiency. In the k-means clustering algorithm, n objects are clustered into k clusters or partitions on the basis of attributes, where k < n and k is a positive integer number. In simple words, in k-means clustering algorithm, the objects are grouped into ‘k’ number of clusters on the basis of attributes or features. What? k-means is one of the simplest algorithm which uses unsupervised learning method to be used in data mining. It works really well with large datasets with unlabelled data. Why? K-means Clustering How?

The Algorithm

Understanding through an example

The clusters obtained are: {1,2} and {3,4,5,6,7}

● Instances in Cluster 1, i.e., C1 are 1, 5, 6 and 7

● Instances in Cluster 2, i.e., C2 are 2 and 8

● Instances in Cluster 3, i.e., C3 are 3, 4, 9 and 10

The first iteration results in four, two and four

students in the first, second and third cluster,

respectively.

Again

● Instances in Cluster 1, i.e., C1 are

1, 5, 6 and 7

● Instances in Cluster 2, i.e., C2 are 2

and 8

● Instances in Cluster 3, i.e., C3 are

3, 4, 9 and 10 The first iteration

results in four, two and four students

in the first, second and third cluster,

respectively.

. Hierarchical Methods.

It creates a hierarchical decomposition of the set of data (or objects) using some criterion. ● Hierarchical clustering is a type of cluster analysis which seeks to generate a hierarchy of clusters. It is also called Hierarchical Cluster Analysis (HCA). ● Hierarchical clustering methods build a nested series of clusters in comparison to partitioned methods that generate only a flat set of clusters. ● There are two types of Hierarchical clustering: agglomerative and divisive. ● Agglomerative is a ‘bottom-up’ approach. ● In this approach, each object is a cluster by itself at the start and its nearby clusters are repetitively combined resulting in larger and larger clusters until some stopping criterion is met. ● The stopping criterion may be the specified number of clusters or a stage at which all the objects are combined into a single large cluster that is the highest level of hierarchy

In agglomerative clustering, suppose a set of N items are given to be clustered, the procedure of clustering is given as follows: ● First of all, each item is allocated to a cluster. For example, if there are N items then, there will be N clusters, i.e., each cluster consisting of just one item. Assume that the distances (similarities) between the clusters are same as the distances (similarities) between the items they contain. ● Identify the nearest (most similar) pair of clusters and merge that pair into a single cluster, so that now we will have one cluster less. ● Now, the distances (similarities) between each of the old clusters and the new cluster are calculated. ● Repeat steps 2 and 3 until all items are merged into a single cluster of size N. As this type of hierarchical clustering merges clusters recursively, it is therefore known as agglomerative( the action or process of collecting in a mass).

Agglomerative:The Algorithm

It is required to compute the distance between clusters by using some metrics before any clustering is performed. These metrics are called linkage metrics.

Linkage metrics

Single Linkage Average Linkage Complete Linkage

Partitioning and Hierarchical Methods for Clustering, Summaries of Data Warehousing

Related documents

Partial preview of the text

Download Partitioning and Hierarchical Methods for Clustering and more Summaries Data Warehousing in PDF only on Docsity!

Partitioning and Hierarchical

Methods for Clustering

Topic

Clustering

. Partitioning Methods.

The Algorithm

Understanding through an example

● Instances in Cluster 1, i.e., C1 are 1, 5, 6 and 7

● Instances in Cluster 2, i.e., C2 are 2 and 8

● Instances in Cluster 3, i.e., C3 are 3, 4, 9 and 10

The first iteration results in four, two and four

students in the first, second and third cluster,

respectively.

Again

● Instances in Cluster 1, i.e., C1 are

1, 5, 6 and 7

● Instances in Cluster 2, i.e., C2 are 2

and 8

● Instances in Cluster 3, i.e., C3 are

3, 4, 9 and 10 The first iteration

results in four, two and four students

in the first, second and third cluster,

respectively.

. Hierarchical Methods.

Agglomerative:The Algorithm

Linkage metrics

Understanding through an example