Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the fit and the partial_fit of the sklearn LatentDirichletAllocation return different results ?

What is strange is that it seems to be exactly the same code for the fit and for the partial_fit.

You can see the code at the following link :

https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L478

like image 623
augustin-barillec Avatar asked Feb 12 '16 14:02

augustin-barillec


People also ask

What does fit function do in Scikit learn?

The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning.

What is Partial_fit?

partial_fit is a handy API that can be used to perform incremental learning in a mini-batch of an out-of-memory dataset. The primary purpose of using warm_state is to reducing training time when fitting the same dataset with different sets of hyperparameter values.

What is Max_iter in LDA?

Number of Topics: n_components is the number of topics to find from the corpus. The number of maximum iterations: max_iter: It is the number of maximum iterations allowed for the LDA algorithm to converge.


1 Answers

Not exactly the same code; partial_fit uses total_samples:

" total_samples : int, optional (default=1e6) Total number of documents. Only used in the partial_fit method."

https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L184

(partial fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L472

(fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L510

Just in case it is of your interest: partial_fit is a good candidate to be used whenever your dataset is really really big. So, instead of running into possible memory problems you perform your fitting in smaller batches, which is called incremental learning.

So, in your case you should take into account that total_samples default's value is 1000000.0. Therefore, if you don't change this number and your real number of samples is bigger then you'll get different results from the fit method and fit_partial. Or maybe it could be the case that you are using mini-batches in the fit_partial and not covering all the samples that you provide to the fit method. And even if you do this right, you could also get different results, as stated in the documentation:

  • "the incremental learner itself may be unable to cope with new/unseen targets classes. In this case you have to pass all the possible classes to the first partial_fit call using the classes= parameter."
  • "[...] choosing a proper algorithm is that all of them don’t put the same importance on each example over time [...]"

sklearn documentation: https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning

like image 173
Guiem Bosch Avatar answered Sep 21 '22 13:09

Guiem Bosch