I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?

Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0). As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.

How do I determine k when using k-means clustering?

2 Answers

You can maximize the Bayesian Information Criterion (BIC):

BIC(C | X) = L(X | C) - (p / 2) * log n

where L(X | C) is the log-likelihood of the dataset X according to model C, p is the number of parameters in the model C, and n is the number of points in the dataset. See "X-means: extending K-means with efficient estimation of the number of clusters" by Dan Pelleg and Andrew Moore in ICML 2000.

Another approach is to start with a large value for k and keep removing centroids (reducing k) until it no longer reduces the description length. See "MDL principle for robust vector quantisation" by Horst Bischof, Ales Leonardis, and Alexander Selb in Pattern Analysis and Applications vol. 2, p. 59-72, 1999.

Finally, you can start with one cluster, then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution. In "Learning the k in k-means" (NIPS 2003), Greg Hamerly and Charles Elkan show some evidence that this works better than BIC, and that BIC does not penalize the model's complexity strongly enough.

149

answered Oct 20 '22 13:10

Vebjorn Ljosa

Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0).

As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.

answered Oct 20 '22 12:10

Jan Krüger

Related questions
                            
                                sklearn agglomerative clustering linkage matrix
                            
                                What makes the distance measure in k-medoid "better" than k-means?
                            
                                How does clustering (especially String clustering) work?
                            
                                How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
                            
                                Choosing eps and minpts for DBSCAN (R)?
                            
                                Text clustering with Levenshtein distances
                            
                                scikit-learn DBSCAN memory usage
                            
                                How to get the samples in each cluster?
                            
                                kmeans: Quick-TRANSfer stage steps exceeded maximum
                            
                                Calculating the percentage of variance measure for k-means?
                            
                                How Could One Implement the K-Means++ Algorithm?
                            
                                scikit-learn: Predicting new points with DBSCAN
                            
                                Plot dendrogram using sklearn.AgglomerativeClustering
                            
                                Python k-means algorithm
                            
                                Scikit Learn - K-Means - Elbow - criterion
                            
                                plotting results of hierarchical clustering ontop of a matrix of data in python
                            
                                K-means algorithm variation with equal cluster size
                            
                                Unsupervised clustering with unknown number of clusters
                            
                                1D Number Array Clustering
                            
                                What is an intuitive explanation of the Expectation Maximization technique? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I determine k when using k-means clustering?

Tags:

cluster-analysis

k-means

Jason Baker

People also ask

2 Answers

Vebjorn Ljosa

Jan Krüger

Recent Activity

Donate For Us