I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.
Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.
Much appreciate your help.
Regards,
Conventional wisdom holds that Mini-Batch K-Means
should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.
Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means
or mini batch k-means
and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.
Theoretically, there is no reason why Mini-Batch K-Means
should underperform K-Means
due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With