K Means Algorithm

In unsupervised learning method, K-means is a most popular algorithm that used to cluster a given data set into K regions. The process of grouping a set of objects/items based on similarity or dissimilarity is known as clustering. Clustering is mostly used in business market analysis. But now-a-days, you can find clustering into almost all applications. The K-mean is a simple method for cluster analysis in data mining. K-mean algorithm partitions m data/items into K clusters based on the nearest mean value of centroid, serving as a prototype of cluster.

“How does the k-means algorithm work?” There are mainly two steps. First, it randomly selects K objects, each of which initially represents a cluster mean or center. The objects are known as centroid cluster. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges.


Figure: 1

Figure: 2
Suppose, we have a dataset that is shown in Figure 1. Now, we will set two points randomly. The blue and red marks are shown in Figure 2. Thus, the value of K is 2. Then, we will compute distance of each points from the marks and assign different colors to these points based on distance value. Suppose, if a point is near to the blue point, rather than the red point; it would be marked as a blue point. After marking each point with red or blue color, we would get a figure like Figure 3.


Figure: 3

Figure: 4
Next, we need to update the cluster centers. That is, the mean value of each cluster is recalculated based on the current objects/points/items in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. (Figure: 4)

The process runs iteratively that improves the partitioning. Eventually, no redistribution of the objects in any cluster occurs, and so the process terminates. The resulting clusters are returned by the clustering process. The final output is shown in Figure 5.

Figure: 5 [Final Result]
The whole process of K-mean is shown at the following  pseudo-code:

Figure: 6

Figure: Pseudocode
There are two problems:
1) How can we set the K points?
2) What would be the value of K?

We will get a local optimal result, if we cannot properly set the initial centroid cluster location. A good way to initialize K-means is to select K (distinct) examples from the training set and set the cluster centroids equal to these selected examples.  

Another problem is the value of K. In Figure 7, we get 3 (= K) clusters. But, it is also possible to set 2 (= K) clusters with same dataset, as shown in figure 8.

Figure: 7

Figure: 8











Elbow Algorithm:
This algorithm helps us to select the value of K. At first, we can make a graph of cost function based on different clusters(K) in same dataset. If we plot the graph, then we will get a point where the line slope have changed, as shown in Figure: 10. The point is known as Elbow point that is our expected value of K. But, we can get a line where the slope have a almost constant value, like Figure 11. In this reason, Elbow algorithm may not help us in all scenarios.

Figure: 10 [Elbow Process]
Figure: 11 [Problem in Elbow Process]











Sometimes we can assume the value of K from the application of our real datasets. Suppose, we want to know different sizes of T-Shirt based on different user height, weight. Then, the value of size (= K) may 3 or 4 (like: small, medium, large, extra large), but not 5 or 6. As a result, we can overcome the problem of assigning K value.

Applications:
It has been successfully used in various topics, including  market segmentationcomputer visiongeostatistics,[23] astronomy and agriculture. It often is used as a pre-processing step for other algorithms, for example to find a starting configuration.

Limitations:
K mean cannot works efficiently in noise data or outliers, because an item with large value may distort the distribution of data. There is another process known as K-medoids that is more robust to noise and outliers compared to K-mean. In Wikipedia, you will get a nice brief about K-medoids with examples. 


Reference:
[1] https://www.coursera.org/course/ml
[2] Data Mining Concepts & Techniques by Han & Kambler

মন্তব্যসমূহ