Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MiniBatchKMeans Parameters

I am trying to cluster patches of images with Sklearn's Minibatch K-Means to reproduce the results of this paper. Here is some information on my dataset:

  • 400,000 rows
  • 108 dimensions
  • 1600 clusters.

Can I get some guidance on how to set the parameters for Minibatch KMeans? Currently, the inertia starts to converge but then it suddenly rises again and then the algorithm is terminated:

Minibatch iteration 48/1300:mean batch inertia: 22.392906, ewa inertia: 22.500929 
Minibatch iteration 49/1300:mean batch inertia: 22.552454, ewa inertia: 22.509173 
Minibatch iteration 50/1300:mean batch inertia: 22.582834, ewa inertia: 22.520959 
Minibatch iteration 51/1300:mean batch inertia: 22.448639, ewa inertia: 22.509388 
Minibatch iteration 52/1300:mean batch inertia: 22.576970, ewa inertia: 22.520201 
Minibatch iteration 53/1300:mean batch inertia: 22.489388, ewa inertia: 22.515271 
Minibatch iteration 54/1300:mean batch inertia: 22.465019, ewa inertia: 22.507231 
Minibatch iteration 55/1300:mean batch inertia: 22.434557, ewa inertia: 22.495603 
[MiniBatchKMeans] Reassigning 766 cluster centers.
Minibatch iteration 56/1300:mean batch inertia: 22.513578, ewa inertia: 22.498479 
[MiniBatchKMeans] Reassigning 767 cluster centers.
Minibatch iteration 57/1300:mean batch inertia: 26.445686, ewa inertia: 23.130030 
Minibatch iteration 58/1300:mean batch inertia: 26.419483, ewa inertia: 23.656341 
Minibatch iteration 59/1300:mean batch inertia: 26.599368, ewa inertia: 24.127225 
Minibatch iteration 60/1300:mean batch inertia: 26.479168, ewa inertia: 24.503535 
Minibatch iteration 61/1300:mean batch inertia: 26.249822, ewa inertia: 24.782940 
Minibatch iteration 62/1300:mean batch inertia: 26.456175, ewa inertia: 25.050657 
Minibatch iteration 63/1300:mean batch inertia: 26.320527, ewa inertia: 25.253836 
Minibatch iteration 64/1300:mean batch inertia: 26.336147, ewa inertia: 25.427005 

The image patches I produce don't look like what the authors of the paper get. Can I have some guidance on how to set the parameters for MiniBatchKmeans for better results? Here are my current parameters:

kmeans = MiniBatchKMeans(n_clusters=self.num_centroids, verbose=True, batch_size=self.num_centroids * 20,compute_labels=False,
like image 503
mchangun Avatar asked Jan 30 '14 03:01

mchangun


People also ask

What is Minibatchkmeans?

Mini Batch K-means ([11]) has been proposed as an alternative to the K-means algorithm for clustering massive datasets. The advantage of this algorithm is to reduce the computational cost by not using all the dataset each iteration but a subsample of a fixed size.

How do you use Minibatchkmeans?

Mini Batch K-means algorithm's main idea is to use small random batches of data of a fixed size, so they can be stored in memory. Each iteration a new random sample from the dataset is obtained and used to update the clusters and this is repeated until convergence.

What does 0.0k mean?

0.0, “K” 0.0,, “M” Zero is used to display insignificant zeros when the number has fewer digits than the format represented using zero. For example, a custom format 0.00 will display number 5, 8.5, and 10.99 as 5.00, 8.50, and 10.99 respectively.

What values can be used for the linkage parameter in Agglomerativeclustering?

method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward. D”, “ward. D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.


1 Answers

The behaviour you are seeing is controlled by the reassignment_ratio parameter. MiniBatchKMeans tries to avoid creating overly unbalanced classes. Whenever the ratio of the sizes of the smallest & largest cluster drops below this, the centers the clusters below the threshold are randomly reinitialized. This is what is incated by

[MiniBatchKMeans] Reassigning 766 cluster centers.

The larger the number clusters, the bigger the expected spread in cluster sizes (and thus smaller smallest/biggest ratio) even in a good clustering. The default setting is reassignment_ratio=0.01 which is too large for 1600 clusters. For cluster sizes of over 1000, I usually just use reassignment_ratio=0. I have yet to see an improvement from a reassignment in such situations.

If you want to experiment with reassignment, see if something like reassignment_ratio=10**-4 is better than just 0. Keep an eye on the log messages. If more than 1 or 2 clusters are getting reassigned at once, you should probably reduce reassignment_ratio further. You may also want to increase max_no_improvement to make sure the algorithm has enough time to recover from the randomization introduced by reassignment, since that is likely to makes things worse at least initially, even if it gets you out of a local minimum in the long run. Increasing the batch size may also help avoid reassignment triggering by some clusters becoming to small just from sampling variation.

like image 68
Daniel Mahler Avatar answered Oct 23 '22 15:10

Daniel Mahler