I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.: <pre class="prettyprint"><code>metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???) </code></pre> I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples? I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results. I'm also open to alternative, more scalable evaluation metrics. <hr> Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters <img src="https://i.stack.imgur.com/ie9DY.png" alt="enter image description here"> What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.

Other metrics <ol> <li>Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off. </li> <li>Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.</li> <li>You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).</li> </ol> How much to sample It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.

Efficient k-means evaluation with silhouette score in sklearn

Tags:

python

cluster-analysis

scikit-learn

I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:

metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)

I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?

I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.

I'm also open to alternative, more scalable evaluation metrics.

Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters enter image description here

What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.

812

asked May 15 '14 19:05

moustachio

2 Answers

kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.

Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.

Reason for the weird observations could be different starting points for different sized samples.

Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.

138

answered Oct 19 '22 07:10

Bussller

Other metrics

Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).

How much to sample

It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.

answered Oct 19 '22 06:10

roblanf

Related questions
                            
                                Solving PDE with implicit euler in python - incorrect output
                            
                                Putting .SVG images into tkinter Frame
                            
                                How to export a plotly dashboard app into a html standalone file to share with the others?
                            
                                django-paypal setup
                            
                                How to correlate two time series with gaps and different time bases?
                            
                                GUI development with IronPython and Visual Studio 2010
                            
                                How to create a custom Python exception type in C extension?
                            
                                What can change my floating point control word behind my back?
                            
                                Using the python multiprocessing module for IO with pygame on Mac OS 10.7
                            
                                Logic game: maximising (or minimising) the chances for two agents to meet
                            
                                UnknownTimezoneError Exception Raised with Python Application Compiled with Py2Exe
                            
                                Django. Thread safe update or create.
                            
                                scikits learn and nltk: Naive Bayes classifier performance highly different
                            
                                Python: curses key codes to readable (vim-like?) syntax
                            
                                Does it make sense to modify in-place AND return a copy?
                            
                                Using numpy.take for faster fancy indexing
                            
                                Python library for creating tree graphs out of nested Python objects (dicts)
                            
                                Why does PIP convert underscores to dashes
                            
                                Can one upload files using Python SimpleHTTPServer or cgi?
                            
                                How to prevent adding two arrays by broadcasting in numpy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With