Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving k-means clustering

My lecture notes on computer vision mention that the performance of the k-means clustering algorithm can be improved if we know the standard deviation of the clusters. How so?

My thinking is that we can use the standard deviations to come up with a better initial estimate through histogram based segmentation first. What do you think? Thanks for any help!

like image 328
Dhruv Gairola Avatar asked Jan 10 '11 14:01

Dhruv Gairola


People also ask

How can you improve the performance of a clustering method?

Graph-based clustering performance can easily be improved by applying ICA blind source separation during the graph Laplacian embedding step. Applying unsupervised feature learning to input data using either RICA or SFT, improves clustering performance.

What are the main weaknesses of K-means clustering?

The most important limitations of Simple k-means are: The user has to specify k (the number of clusters) in the beginning. k-means can only handle numerical data. k-means assumes that we deal with spherical clusters and that each cluster has roughly equal numbers of observations.

How do I make Kmeans faster?

A primary method of accelerating k-means is applying geometric knowledge to avoid computing point-center distances when possible. Elkan's algorithm [8] exploits the triangle inequality to avoid many dis- tance computations, and is the fastest current algorithm for high-dimensional data.

How do you optimize the objective function of the K-means clustering algorithm?

12.2 - The Algorithm The k-means algorithm alternates the two steps: For a fixed set of centroids (prototypes), optimize A(•) by assigning each sample to its closest centroid using Euclidean distance. Update the centroids by computing the average of all the samples assigned to it.


1 Answers

Your lecturer might have the 2002 paper by Veenman et al in mind. The basic idea is that you set the maximum variance you allow in each cluster. You start with as many clusters as data points and then you "evolve" clusters by

  • merging neighboring clusters if the resulting cluster's variance is below the threshold
  • isolating elements that are "far" if a cluster's variance is above the threshold
  • or moving some elements between neighboring clusters if it decreases the sum of squared errors

(this evolution acts as a global optimization procedure, and prevents the bad consequences of initial assignment of cluster means you have in k-means)

To sum up, if you know the variance, you know how varied the clusters should be, so it's easier to e.g. detect outliers (which usually should be put into separate clusters).

like image 72
ang mo Avatar answered Sep 26 '22 03:09

ang mo