Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing parallel k-means batch vs mini-batch speed

Tags:

scikit-learn

I am trying to cluster 1000 dimension, 250k vectors using k-means. The machine that I am working on has 80 dual-cores.

Just confirming, if anyone has compared the run-time of k-means default batch parallel version against k-means mini-batch version? The example comparison page on sklean documents doesn't provide much info as the dataset is quite small.

Much appreciate your help.

Regards,

like image 864
PS1 Avatar asked Jan 16 '15 15:01

PS1


1 Answers

Conventional wisdom holds that Mini-Batch K-Means should be faster and more efficient for greater than 10,000 samples. Since you have 250,000 samples, you should probably use mini-batch if you don't want to test it out on your own.

Note that the example you referenced can very easily be changed to a 5000, 10,000 or 20,000 point example by changing n_samples in this line:

X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)

I agree that this won't necessarily scale the same for 1000 dimensional vectors, but since you are constructing the example and are using either k-means or mini batch k-means and it only takes a second to switch between them... You should just do a scaling study for your 1000 dimensional vectors for 5k, 10k, 15k, 20k samples.

Theoretically, there is no reason why Mini-Batch K-Means should underperform K-Means due to vector dimensionality and we know that it does better for larger sample sizes, so I would go with mini batch off the cuff e.g. bias for action over research.

like image 68
AN6U5 Avatar answered Oct 18 '22 06:10

AN6U5