I am trying to cluster patches of images with Sklearn's Minibatch K-Means to reproduce the results of this paper. Here is some information on my dataset:
Can I get some guidance on how to set the parameters for Minibatch KMeans? Currently, the inertia starts to converge but then it suddenly rises again and then the algorithm is terminated:
Minibatch iteration 48/1300:mean batch inertia: 22.392906, ewa inertia: 22.500929
Minibatch iteration 49/1300:mean batch inertia: 22.552454, ewa inertia: 22.509173
Minibatch iteration 50/1300:mean batch inertia: 22.582834, ewa inertia: 22.520959
Minibatch iteration 51/1300:mean batch inertia: 22.448639, ewa inertia: 22.509388
Minibatch iteration 52/1300:mean batch inertia: 22.576970, ewa inertia: 22.520201
Minibatch iteration 53/1300:mean batch inertia: 22.489388, ewa inertia: 22.515271
Minibatch iteration 54/1300:mean batch inertia: 22.465019, ewa inertia: 22.507231
Minibatch iteration 55/1300:mean batch inertia: 22.434557, ewa inertia: 22.495603
[MiniBatchKMeans] Reassigning 766 cluster centers.
Minibatch iteration 56/1300:mean batch inertia: 22.513578, ewa inertia: 22.498479
[MiniBatchKMeans] Reassigning 767 cluster centers.
Minibatch iteration 57/1300:mean batch inertia: 26.445686, ewa inertia: 23.130030
Minibatch iteration 58/1300:mean batch inertia: 26.419483, ewa inertia: 23.656341
Minibatch iteration 59/1300:mean batch inertia: 26.599368, ewa inertia: 24.127225
Minibatch iteration 60/1300:mean batch inertia: 26.479168, ewa inertia: 24.503535
Minibatch iteration 61/1300:mean batch inertia: 26.249822, ewa inertia: 24.782940
Minibatch iteration 62/1300:mean batch inertia: 26.456175, ewa inertia: 25.050657
Minibatch iteration 63/1300:mean batch inertia: 26.320527, ewa inertia: 25.253836
Minibatch iteration 64/1300:mean batch inertia: 26.336147, ewa inertia: 25.427005
The image patches I produce don't look like what the authors of the paper get. Can I have some guidance on how to set the parameters for MiniBatchKmeans for better results? Here are my current parameters:
kmeans = MiniBatchKMeans(n_clusters=self.num_centroids, verbose=True, batch_size=self.num_centroids * 20,compute_labels=False,
Mini Batch K-means ([11]) has been proposed as an alternative to the K-means algorithm for clustering massive datasets. The advantage of this algorithm is to reduce the computational cost by not using all the dataset each iteration but a subsample of a fixed size.
Mini Batch K-means algorithm's main idea is to use small random batches of data of a fixed size, so they can be stored in memory. Each iteration a new random sample from the dataset is obtained and used to update the clusters and this is repeated until convergence.
0.0, “K” 0.0,, “M” Zero is used to display insignificant zeros when the number has fewer digits than the format represented using zero. For example, a custom format 0.00 will display number 5, 8.5, and 10.99 as 5.00, 8.50, and 10.99 respectively.
method: The agglomeration (linkage) method to be used for computing distance between clusters. Allowed values is one of “ward. D”, “ward. D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
The behaviour you are seeing is controlled by the reassignment_ratio
parameter. MiniBatchKMeans tries to avoid creating overly unbalanced classes. Whenever the ratio of the sizes of the smallest & largest cluster drops below this, the centers the clusters below the threshold are randomly reinitialized. This is what is incated by
[MiniBatchKMeans] Reassigning 766 cluster centers.
The larger the number clusters, the bigger the expected spread in cluster sizes (and thus smaller smallest/biggest ratio) even in a good clustering. The default setting is reassignment_ratio=0.01
which is too large for 1600 clusters. For cluster sizes of over 1000, I usually just use reassignment_ratio=0
. I have yet to see an improvement from a reassignment in such situations.
If you want to experiment with reassignment, see if something like reassignment_ratio=10**-4
is better than just 0. Keep an eye on the log messages. If more than 1 or 2 clusters are getting reassigned at once, you should probably reduce reassignment_ratio
further. You may also want to increase max_no_improvement
to make sure the algorithm has enough time to recover from the randomization introduced by reassignment, since that is likely to makes things worse at least initially, even if it gets you out of a local minimum in the long run.
Increasing the batch size may also help avoid reassignment triggering by some clusters becoming to small just from sampling variation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With