I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way). The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks). To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns. My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic. The problem I'm coming to is this: How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable? Questions: <ul> <li>Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).</li> <li>is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.</li> <li>what short cuts/approximations would you recommend to increase the performance.</li> </ul>

Here's my approximate solution: <ol> <li>Start with k=2. </li> <li>For a number of tries: <ol> <li>Run the k-means algorithm to find k clusters. </li> <li>Find the mean square distance from the origin to the cluster centroids.</li> </ol> </li> <li>Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.</li> <li>If stability of clusters for k < stability of clusters for k - 1 then return k - 1 </li> <li>Increment k by 1.</li> </ol> The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k. If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.

In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering. Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees. Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool. Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.

You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels. One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement. It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.

From your wikipedia link: <blockquote> Regarding computational complexity, the k-means clustering problem is: <ul> <li> NP-hard in general Euclidean space d even for 2 clusters </li> <li> NP-hard for a general number of clusters k even in the plane </li> <li>If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n), where n is the number of entities to be clustered</li> </ul> Thus, a variety of <a href="http://en.wikipedia.org/wiki/Heuristic_algorithm" rel="nofollow noreferrer">heuristic algorithms</a> are generally used. </blockquote> That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best). I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems. If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.

This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here https://en.wikipedia.org/wiki/Cluster_analysis#Applications and here: https://en.wikipedia.org/wiki/Silhouette_(clustering) : "silhouette plots and averages may be used to determine the natural number of clusters within a dataset" scikit-learn provides a sample usage implementation of the methodology here http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Determining the best k for a k nearest neighbour

8 Answers

For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.

Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.

There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.

answered Sep 26 '22 01:09

tom10

In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.

If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.

More infos on your data set would be very helpful for a more precise answer.

answered Sep 23 '22 01:09

Aszarsha

Here's my approximate solution:

Start with k=2.
For a number of tries:
1. Run the k-means algorithm to find k clusters.
2. Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.

The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.

If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.

answered Sep 26 '22 01:09

jamesh

In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.

Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.

Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.

answered Sep 24 '22 01:09

Amro

You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.

One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.

It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.

answered Sep 24 '22 01:09

Alex Reynolds

From your wikipedia link:

Regarding computational complexity, the k-means clustering problem is:

NP-hard in general Euclidean space d even for 2 clusters

NP-hard for a general number of clusters k even in the plane

If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n), where n is the number of entities to be clustered

Thus, a variety of heuristic algorithms are generally used.

That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).

I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.

If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.

answered Sep 26 '22 01:09

jilles de wit

Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.

answered Sep 23 '22 01:09

Yuval F

This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here

https://en.wikipedia.org/wiki/Cluster_analysis#Applications

and here:

https://en.wikipedia.org/wiki/Silhouette_(clustering) :

"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"

scikit-learn provides a sample usage implementation of the methodology here http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

answered Sep 26 '22 01:09

user1767316

Related questions
                            
                                Algorithm for a geodesic sphere
                            
                                Do recursive functions have a minimum space complexity of O(N)?
                            
                                MaxDoubleSliceSum Codility Algorithm
                            
                                Confused about the definition of "Exact Algorithm"
                            
                                Counting all contiguous sub-arrays given sum zero
                            
                                Retry Pattern Vs fall back pattern in rest client
                            
                                Get distance of elements inside an array?
                            
                                Lottery algorithm - PHP - math seems good, but is the function valid?
                            
                                Algorithm for 2D Raytracer
                            
                                does a computer take more time to multiply, divide, subtract, add two big numbers than smaller number
                            
                                Adding N line breaks in a paragraph for the narrowest result
                            
                                Data structures for fast intersection operations?
                            
                                How to evaluate Reverse polish notation using stacks
                            
                                Weird bug in Javascript splice method
                            
                                FInd Next smallest number with same digits Python
                            
                                How do I find the shortest path that covers all nodes in a directed cyclic graph?
                            
                                How can I transform a string into an abbreviated form?
                            
                                Using TSQL, can I increment a CHAR(1) column by one and use it in a LEFT OUTER JOIN without a CASE statement?
                            
                                Java equivalent of C#'s Rfc2898DerivedBytes
                            
                                Algorithm to permute elements in Array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Determining the best k for a k nearest neighbour

Tags:

language-agnostic

algorithm

complexity-theory

artificial-intelligence

cluster-analysis

jamesh

People also ask

8 Answers

tom10

Aszarsha

jamesh

Amro

Alex Reynolds

jilles de wit

Yuval F

user1767316

Recent Activity

Donate For Us