Origin of the problem is common: presence of a lot of train data, which was read in chunks. Point of interest is to fit sequentially the desired model on chunked data sets, keeping states of previous fitting. Are there any methods except <code>partial_fit()</code> to fit model using sklearn on different data? or is there any tricks to rewrite code of <code>fit()</code> function to customize it for this problem? or is it possible somekow realize with <code>pickle</code>?

There is a reason why some models expose <code>partial_fit()</code> and others don't. Every model is a different machine learning algorithm and for many of these algorithms there is just no way to add an element without recalculating the model from scratch. So, if you have to fit the models incrementally, pick an incremental model that has <code>partial_fit()</code>. You can find a full list on this documentation page. Alternatively, you can build an ensemble model. Create a separate <code>Classifier()</code> or <code>Regression()</code> for every chunk of data you have. Then, when you need to predict something, you can just <pre class="prettyprint"><code>for classifier in classifiers: votes[classifier.predict(X)] += 1 prediction = numpy.argmax(votes) </code></pre> or, for regressors <pre class="prettyprint"><code>prediction = numpy.mean([regressor.predict(X) for regressor in regressors] </code></pre>

Sklearn Fit model multiple times

Tags:

python

scikit-learn

Origin of the problem is common:

presence of a lot of train data, which was read in chunks. Point of interest is to fit sequentially the desired model on chunked data sets, keeping states of previous fitting.

Are there any methods except partial_fit() to fit model using sklearn on different data? or is there any tricks to rewrite code of fit() function to customize it for this problem? or is it possible somekow realize with pickle?

337

asked Aug 11 '16 11:08

Marcel Mars

1 Answers

There is a reason why some models expose partial_fit() and others don't. Every model is a different machine learning algorithm and for many of these algorithms there is just no way to add an element without recalculating the model from scratch.

So, if you have to fit the models incrementally, pick an incremental model that has partial_fit(). You can find a full list on this documentation page.

Alternatively, you can build an ensemble model. Create a separate Classifier() or Regression() for every chunk of data you have. Then, when you need to predict something, you can just

for classifier in classifiers:
  votes[classifier.predict(X)] += 1
prediction = numpy.argmax(votes)

or, for regressors

prediction = numpy.mean([regressor.predict(X) for regressor in regressors]

answered Oct 05 '22 22:10

0x60

Related questions
                            
                                Function Approximation: How is tile coding different from highly discretized state space?
                            
                                Vectorized implementation to create multiple rows from a single row in pandas dataframe
                            
                                ForeignKey with multiple models
                            
                                Python "Too many indices for array"
                            
                                How to change tab size in a specific file in Pycharm
                            
                                Is looping through a generator in a loop over that same generator safe in Python?
                            
                                Find the column names which have top 3 largest values for each row
                            
                                How can I change the intensity of a colormap in matplotlib?
                            
                                Plotting hsv values with imshow
                            
                                RabbitMq - pika - python - Dropping messages when published
                            
                                Multiplication of two positive numbers gives a negative output in Python 3
                            
                                Appending to a Pandas Dataframe From a pd.read_sql Output
                            
                                Guided filter in OpenCV and Python
                            
                                stack all levels of a MultiIndex
                            
                                How to reindex a pandas DataFrame after concatenation
                            
                                Is there a pythonic way to process tree-structured dict keys?
                            
                                Pandas: Delete rows based on multiple columns values
                            
                                How can i find all ydl_opts
                            
                                What is the difference between Property Based Testing and Mutation testing?
                            
                                Can't access dataframe columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With