Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differences between MiniBatchKMeans.fit and MiniBatchKMeans.partial_fit

I am interested in sklearn.cluster.MiniBatchKMeans as a way to use huge datasets. Anyway I am a bit confused about the difference between MiniBatchKMeans.partial_fit() and MiniBatchKMeans.fit().

Documentation about fit() states that:
Compute the centroids on X by chunking it into mini-batches.

while documentation about partial_fit() states that:
Update k means estimate on a single mini-batch X.

So, as I understand it fit() splits up the dataset to chunk of data with which it trains the k means (I guess the argument batch_size of MiniBatchKMeans() refers to this one) while partial_fit() uses all data passed to it to update the centres. The term "update" may seem a bit ambiguous indicating an initial training (using fit()) should have been performed or not, but judging from the example in the documentation this is not necessary (I can use partial_fit() at the beginning also).

Is it true that partial_fit() will use all data passed to it regardless of size or is the data size bound to the batch_size passed as argument to the MiniBatchKMeans constructor? Also if batch_size is set to be greater than the actual data size is the result the same as the standard k-means algorithm (I guess efficient could vary in the latter case though due to different architectures).

like image 329
Eypros Avatar asked Oct 31 '18 20:10

Eypros


People also ask

What is difference between K-Means and mini batch K-Means?

Mini Batch K-means ([11]) has been proposed as an alternative to the K-means algorithm for clustering massive datasets. The advantage of this algorithm is to reduce the computational cost by not using all the dataset each iteration but a subsample of a fixed size.

What is difference between K mean clustering and mini batch K mean clustering?

The mini batch K-means is faster but gives slightly different results than the normal batch K-means. Here we cluster a set of data, first with K-means and then with mini batch K-means, and plot the results. We will also plot the points that are labeled differently between the two algorithms.

What does 0.0k mean?

0.0, “K” 0.0,, “M” Zero is used to display insignificant zeros when the number has fewer digits than the format represented using zero. For example, a custom format 0.00 will display number 5, 8.5, and 10.99 as 5.00, 8.50, and 10.99 respectively.

What is batch size in K-Means?

batch_sizeint, default=1024. Size of the mini batches. For faster computations, you can set the batch_size greater than 256 * number of cores to enable parallelism on all cores.


1 Answers

TL;DR

partial_fit is for online clustering were fit is for offline, however i think MiniBatchKMeans's partial_fit method is a little rough.

Long explanation

I diged old PR's from the repo, and found this one, it seems to be the first commit of this implementation, it mentions that this algorithm can implement the partial_fit method as a online clustering method (following the online API discussion).

So as well as the BIRCH implementation, this algorithm uses fit as one time offline clustering and partial_fit as online clustering.

However, i did some tests comparing the ARI of the result labels by using the fit in the entire dataset versus using partial_fit and fit in chunks, and didn't seems to get anywhere, since the ARI result were very low (~0.5), and by changing the initialization apparently the fit chunked beat partial_fit, which doesn't make sense. you can find my notebook here.

So my guess is, based in this response in the PR:

I believe that this branch can and should be merged.

The online fitting API (partial_fit) probably needs to mature, but I think that it is a bit out of scope of this work. The core contribution, that is a mini-batch K-means, is nice, and does seem to speed up things.

Is that the implementation hasn't changed much since that PR, and the partial_fit method is still a little rough, the two implementations from 2011 and now has changed (compared from the release tag), however both of them calls the function _mini_batch_step once in partial_fit (without verbose info) and calls multiple time in fit (with verbose info).

like image 116
giuliano-oliveira Avatar answered Oct 18 '22 18:10

giuliano-oliveira