I am interested in sklearn.cluster.MiniBatchKMeans
as a way to use huge datasets. Anyway I am a bit confused about the difference between MiniBatchKMeans.partial_fit()
and MiniBatchKMeans.fit()
.
Documentation about fit()
states that:
Compute the centroids on X by chunking it into mini-batches.
while documentation about partial_fit()
states that:
Update k means estimate on a single mini-batch X.
So, as I understand it fit()
splits up the dataset to chunk of data with which it trains the k means (I guess the argument batch_size
of MiniBatchKMeans()
refers to this one) while partial_fit()
uses all data passed to it to update the centres. The term "update" may seem a bit ambiguous indicating an initial training (using fit()
) should have been performed or not, but judging from the example in the documentation this is not necessary (I can use partial_fit()
at the beginning also).
Is it true that partial_fit()
will use all data passed to it regardless of size or is the data size bound to the batch_size
passed as argument to the MiniBatchKMeans
constructor? Also if batch_size
is set to be greater than the actual data size is the result the same as the standard k-means algorithm (I guess efficient could vary in the latter case though due to different architectures).
Mini Batch K-means ([11]) has been proposed as an alternative to the K-means algorithm for clustering massive datasets. The advantage of this algorithm is to reduce the computational cost by not using all the dataset each iteration but a subsample of a fixed size.
The mini batch K-means is faster but gives slightly different results than the normal batch K-means. Here we cluster a set of data, first with K-means and then with mini batch K-means, and plot the results. We will also plot the points that are labeled differently between the two algorithms.
0.0, “K” 0.0,, “M” Zero is used to display insignificant zeros when the number has fewer digits than the format represented using zero. For example, a custom format 0.00 will display number 5, 8.5, and 10.99 as 5.00, 8.50, and 10.99 respectively.
batch_sizeint, default=1024. Size of the mini batches. For faster computations, you can set the batch_size greater than 256 * number of cores to enable parallelism on all cores.
partial_fit
is for online clustering were fit
is for offline, however i think MiniBatchKMeans's partial_fit
method is a little rough.
I diged old PR's from the repo, and found this one, it seems to be the first commit of this implementation, it mentions that this algorithm can implement the partial_fit
method as a online clustering method (following the online API discussion).
So as well as the BIRCH implementation, this algorithm uses fit
as one time offline clustering and partial_fit
as online clustering.
However, i did some tests comparing the ARI of the result labels by using the fit in the entire dataset versus using partial_fit and fit in chunks, and didn't seems to get anywhere, since the ARI result were very low (~0.5), and by changing the initialization apparently the fit chunked beat partial_fit, which doesn't make sense. you can find my notebook here.
So my guess is, based in this response in the PR:
I believe that this branch can and should be merged.
The online fitting API (partial_fit) probably needs to mature, but I think that it is a bit out of scope of this work. The core contribution, that is a mini-batch K-means, is nice, and does seem to speed up things.
Is that the implementation hasn't changed much since that PR, and the partial_fit method is still a little rough, the two implementations from 2011 and now has changed (compared from the release tag), however both of them calls the function _mini_batch_step
once in partial_fit
(without verbose info) and calls multiple time in fit
(with verbose info).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With