The objectives of this section are:
to explain the K-Means clustering algorithm
to introduce its limitations
to examine the various major parts
By the time you have completed this section you will be able to:
briefly explain the K-Means clustering algorithm
list the major steps in the algorithm
K-means is one of the oldest and most commonly used clustering algorithms. It is a prototype based clustering technique defining the prototype in terms of a centroid which is considered to be the mean of a group of points and is applicable to objects in a continuous n-dimensional space.
The K-means algorithm involves randomly selecting K initial centroids where K is a user defined number of desired clusters. Each point is then assigned to a closest centroid and the collection of points close to a centroid form a cluster. The centroid gets updated according to the points in the cluster and this process continues until the points stop changing their clusters. A high-level representation of the K-means algorithm is shown in Figure 1.
Initialization step: Initial centroids are can be selected randomly.
Assignment step: this step raises the question, how do you define “closest centriod”. In order determine which centroid is closest to a particular data point we have to use a proximity measure. The Euclidean distance is commonly used.
Re-estimation step: a simplistic view point is for the centroid to be calculated as the mean of the points that have been assigned to its grouping.