MiniBatchKMeans Parameters

Tags:

I am trying to cluster patches of images with Sklearn's Minibatch K-Means to reproduce the results of this paper. Here is some information on my dataset:

400,000 rows
108 dimensions
1600 clusters.

Can I get some guidance on how to set the parameters for Minibatch KMeans? Currently, the inertia starts to converge but then it suddenly rises again and then the algorithm is terminated:

Minibatch iteration 48/1300:mean batch inertia: 22.392906, ewa inertia: 22.500929 
Minibatch iteration 49/1300:mean batch inertia: 22.552454, ewa inertia: 22.509173 
Minibatch iteration 50/1300:mean batch inertia: 22.582834, ewa inertia: 22.520959 
Minibatch iteration 51/1300:mean batch inertia: 22.448639, ewa inertia: 22.509388 
Minibatch iteration 52/1300:mean batch inertia: 22.576970, ewa inertia: 22.520201 
Minibatch iteration 53/1300:mean batch inertia: 22.489388, ewa inertia: 22.515271 
Minibatch iteration 54/1300:mean batch inertia: 22.465019, ewa inertia: 22.507231 
Minibatch iteration 55/1300:mean batch inertia: 22.434557, ewa inertia: 22.495603 
[MiniBatchKMeans] Reassigning 766 cluster centers.
Minibatch iteration 56/1300:mean batch inertia: 22.513578, ewa inertia: 22.498479 
[MiniBatchKMeans] Reassigning 767 cluster centers.
Minibatch iteration 57/1300:mean batch inertia: 26.445686, ewa inertia: 23.130030 
Minibatch iteration 58/1300:mean batch inertia: 26.419483, ewa inertia: 23.656341 
Minibatch iteration 59/1300:mean batch inertia: 26.599368, ewa inertia: 24.127225 
Minibatch iteration 60/1300:mean batch inertia: 26.479168, ewa inertia: 24.503535 
Minibatch iteration 61/1300:mean batch inertia: 26.249822, ewa inertia: 24.782940 
Minibatch iteration 62/1300:mean batch inertia: 26.456175, ewa inertia: 25.050657 
Minibatch iteration 63/1300:mean batch inertia: 26.320527, ewa inertia: 25.253836 
Minibatch iteration 64/1300:mean batch inertia: 26.336147, ewa inertia: 25.427005

The image patches I produce don't look like what the authors of the paper get. Can I have some guidance on how to set the parameters for MiniBatchKmeans for better results? Here are my current parameters:

kmeans = MiniBatchKMeans(n_clusters=self.num_centroids, verbose=True, batch_size=self.num_centroids * 20,compute_labels=False,

503

asked Jan 30 '14 03:01

mchangun

1 Answers

The behaviour you are seeing is controlled by the reassignment_ratio parameter. MiniBatchKMeans tries to avoid creating overly unbalanced classes. Whenever the ratio of the sizes of the smallest & largest cluster drops below this, the centers the clusters below the threshold are randomly reinitialized. This is what is incated by

[MiniBatchKMeans] Reassigning 766 cluster centers.

The larger the number clusters, the bigger the expected spread in cluster sizes (and thus smaller smallest/biggest ratio) even in a good clustering. The default setting is reassignment_ratio=0.01 which is too large for 1600 clusters. For cluster sizes of over 1000, I usually just use reassignment_ratio=0. I have yet to see an improvement from a reassignment in such situations.

If you want to experiment with reassignment, see if something like reassignment_ratio=10**-4 is better than just 0. Keep an eye on the log messages. If more than 1 or 2 clusters are getting reassigned at once, you should probably reduce reassignment_ratio further. You may also want to increase max_no_improvement to make sure the algorithm has enough time to recover from the randomization introduced by reassignment, since that is likely to makes things worse at least initially, even if it gets you out of a local minimum in the long run. Increasing the batch size may also help avoid reassignment triggering by some clusters becoming to small just from sampling variation.

answered Oct 23 '22 15:10

Daniel Mahler

Related questions
                            
                                Which Twitter wrapper libs support Python 3.x?
                            
                                Determinant of Multidimensional array
                            
                                Finding Memory Usage, CPU utilization, Execution time for running a python script
                            
                                Python - C embedded Segmentation fault
                            
                                How to find indexes of string in lists which starts with some substring?
                            
                                Industrial vision camera with Python [closed]
                            
                                Does embedding c++ code in python make your python application faster? [closed]
                            
                                Passing 3-dimensional numpy array to C
                            
                                celery tutorial: NotRegistered error
                            
                                Refresh a local web page using Python
                            
                                Puzzling "'tuple' object does not support item assignment" error [duplicate]
                            
                                Using scipy.interpolate.splrep function
                            
                                fifo - reading in a loop
                            
                                using flask-sqlalchemy without the subclassed declarative base
                            
                                How to handle Python multiprocessing database concurrency, specifically with django?
                            
                                Python unittest data provider
                            
                                Element-wise maximum of two sparse matrices
                            
                                Django i18n: recommended size and formatting for {% blocktrans %} blocks?
                            
                                How to POST multiple FILES using Flask test client?
                            
                                Install paramiko on Windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MiniBatchKMeans Parameters

Tags:

python

k-means

scikit-learn

mchangun

People also ask

1 Answers

Daniel Mahler

Recent Activity

Donate For Us