
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The k-means clustering algorithm, which is an unsupervised machine learning-based algorithm that groups similar objects together to form clusters. The algorithm works by minimizing the sum of squares of distances between data and the corresponding cluster centroid. The document also highlights some issues with the k-means algorithm, such as its sensitivity to outliers and its restriction to continuous data.
Typology: Summaries
1 / 24
This page cannot be seen from the preview
Don't miss anything!
● Cluster Analysis is the process of grouping a set of similar objects in order to form clusters. ● A group of data points would comprise together to form a cluster in which all the objects would belong to the same group. ● This is done so that records within a cluster (intra- cluster) have high similarity with one another but have high dissimilarities in comparison to objects in other clusters (inter-cluster) ● It is an unsupervised machine learning-based algorithm that acts on unlabelled data.
It constructs random partitions and then iteratively refines them by some criterion. ● Partitioning methods are also called centroid-based clustering, each cluster is represented by a central vector, which is not necessarily a member of the data set. ● The optimization problem(to iteratively refine centroid selection) itself is known to be NP- hard, and thus the common approach is to search only for approximate solutions. ● A particularly well known approximate method is Lloyd's algorithm often just referred to as " k-means algorithm "
The grouping of objects is done by minimizing the sum of squares of distances, i.e., a Euclidean distance between data and the corresponding cluster centroid. After grouping in done once , it is further optimized to achieve a better model that fits data to utmost efficiency. In the k-means clustering algorithm, n objects are clustered into k clusters or partitions on the basis of attributes, where k < n and k is a positive integer number. In simple words, in k-means clustering algorithm, the objects are grouped into ‘k’ number of clusters on the basis of attributes or features. What? k-means is one of the simplest algorithm which uses unsupervised learning method to be used in data mining. It works really well with large datasets with unlabelled data. Why? K-means Clustering How?
The clusters obtained are: {1,2} and {3,4,5,6,7}
The clusters obtained are: {1,2} and {3,4,5,6,7}
It creates a hierarchical decomposition of the set of data (or objects) using some criterion. ● Hierarchical clustering is a type of cluster analysis which seeks to generate a hierarchy of clusters. It is also called Hierarchical Cluster Analysis (HCA). ● Hierarchical clustering methods build a nested series of clusters in comparison to partitioned methods that generate only a flat set of clusters. ● There are two types of Hierarchical clustering: agglomerative and divisive. ● Agglomerative is a ‘bottom-up’ approach. ● In this approach, each object is a cluster by itself at the start and its nearby clusters are repetitively combined resulting in larger and larger clusters until some stopping criterion is met. ● The stopping criterion may be the specified number of clusters or a stage at which all the objects are combined into a single large cluster that is the highest level of hierarchy
In agglomerative clustering, suppose a set of N items are given to be clustered, the procedure of clustering is given as follows: ● First of all, each item is allocated to a cluster. For example, if there are N items then, there will be N clusters, i.e., each cluster consisting of just one item. Assume that the distances (similarities) between the clusters are same as the distances (similarities) between the items they contain. ● Identify the nearest (most similar) pair of clusters and merge that pair into a single cluster, so that now we will have one cluster less. ● Now, the distances (similarities) between each of the old clusters and the new cluster are calculated. ● Repeat steps 2 and 3 until all items are merged into a single cluster of size N. As this type of hierarchical clustering merges clusters recursively, it is therefore known as agglomerative( the action or process of collecting in a mass).
It is required to compute the distance between clusters by using some metrics before any clustering is performed. These metrics are called linkage metrics.
Single Linkage Average Linkage Complete Linkage