Differences between MiniBatchKMeans.fit and MiniBatchKMeans.partial_fit

Tags:

I am interested in sklearn.cluster.MiniBatchKMeans as a way to use huge datasets. Anyway I am a bit confused about the difference between MiniBatchKMeans.partial_fit() and MiniBatchKMeans.fit().

Documentation about fit() states that:
Compute the centroids on X by chunking it into mini-batches.

while documentation about partial_fit() states that:
Update k means estimate on a single mini-batch X.

So, as I understand it fit() splits up the dataset to chunk of data with which it trains the k means (I guess the argument batch_size of MiniBatchKMeans() refers to this one) while partial_fit() uses all data passed to it to update the centres. The term "update" may seem a bit ambiguous indicating an initial training (using fit()) should have been performed or not, but judging from the example in the documentation this is not necessary (I can use partial_fit() at the beginning also).

Is it true that partial_fit() will use all data passed to it regardless of size or is the data size bound to the batch_size passed as argument to the MiniBatchKMeans constructor? Also if batch_size is set to be greater than the actual data size is the result the same as the standard k-means algorithm (I guess efficient could vary in the latter case though due to different architectures).

329

asked Oct 31 '18 20:10

Eypros

1 Answers

TL;DR

partial_fit is for online clustering were fit is for offline, however i think MiniBatchKMeans's partial_fit method is a little rough.

Long explanation

I diged old PR's from the repo, and found this one, it seems to be the first commit of this implementation, it mentions that this algorithm can implement the partial_fit method as a online clustering method (following the online API discussion).

So as well as the BIRCH implementation, this algorithm uses fit as one time offline clustering and partial_fit as online clustering.

However, i did some tests comparing the ARI of the result labels by using the fit in the entire dataset versus using partial_fit and fit in chunks, and didn't seems to get anywhere, since the ARI result were very low (~0.5), and by changing the initialization apparently the fit chunked beat partial_fit, which doesn't make sense. you can find my notebook here.

So my guess is, based in this response in the PR:

I believe that this branch can and should be merged.

The online fitting API (partial_fit) probably needs to mature, but I think that it is a bit out of scope of this work. The core contribution, that is a mini-batch K-means, is nice, and does seem to speed up things.

Is that the implementation hasn't changed much since that PR, and the partial_fit method is still a little rough, the two implementations from 2011 and now has changed (compared from the release tag), however both of them calls the function _mini_batch_step once in partial_fit (without verbose info) and calls multiple time in fit (with verbose info).

116

answered Oct 18 '22 18:10

giuliano-oliveira

Related questions
                            
                                How to do OCR for PDF text extraction WHILE maintaining text structure (header/subtitle/body)
                            
                                Saving grayscale image using matplotlib and when loading it has multiple channels
                            
                                How to access asctime of a python LogRecord?
                            
                                TIS/TSM non-main thread error; pygame script triggered by hotkey (rumps, pygame, keyboard)
                            
                                R readBin vs. Python struct
                            
                                How hide/show a field upon selection of a radio button in django admin?
                            
                                Is it possible to get the md5 hash of a tempfile in python?
                            
                                How to unload a keras/tensorflow model from memory?
                            
                                psycopg2: What page_size to use
                            
                                Remove '\n' in text in pandas python
                            
                                Determine number of records in tf.data.Dataset Tensorflow [duplicate]
                            
                                Quickfix read custom repeating group
                            
                                generating GeoTIFF colormaps
                            
                                OS X: ld: library not found for -lstdc++
                            
                                How do you train GANs using multiple GPUs with Keras?
                            
                                HTTPS Request in Kivy
                            
                                How to get the user set computer name using python in MacOS Sierra/High Sierra
                            
                                pipenv: why to run pipenv lock when lock file is automatically created wheneven i install a package
                            
                                Changing the type of values in arrays resulting from sklearn.model_selection.train_test_split
                            
                                Convert numpy.void to numpy.ndarray

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Differences between MiniBatchKMeans.fit and MiniBatchKMeans.partial_fit

Tags:

python

scikit-learn