How much time should it take to cluster a set of 100'000 L2 normalized 2048-dim feature vectors using k-means with 200 clusters? I have all my data in a huge numpy array, maybe there's a more appropriate data structure?
It didn't seem to do any progress in an hour. I'm also inclining to use the threshold stopping criteria, but it seems to take more than 5 minutes for just 2 iterations. Is there some sort of verbose command I can use to check in on the progress for kmeans clustering on scikit-learn? Does anyone suggest any other approach? Maybe something like dimensionality reduction, or PCA and then kmeans? (I'm just throwing random ideas out there)
If you haven't tried it yet, use sklearn.cluster.MiniBatchKMeans
instead of sklearn.cluster.KMeans
E.g., if X.shape = (100000, 2048)
, then write
from sklearn.cluster import MiniBatchKMeans
mbkm = MiniBatchKMeans(n_clusters=200) # Take a good look at the docstring and set options here
mbkm.fit(X)
MiniBatchKMeans
finds slightly different clusters from normal KMeans
, but has the huge advantage that it is an online algorithm which doesn't need all the data at every iteration and still gives useful results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With