Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retraining an existing machine learning model with new data

I have a ML model which is trained on a million data set (supervised classification on text) , however I want the same model to get trained again as soon as a new data comes in (training data).

This process is continuous and I don't want to loose the power of the model's prediction every time it receives a new data set. I don't want to merge the new data with my history data (~1 million samples) to train again.

So the ideal would be for this model to grow up gradually training on all data over a period of time and preserving the intelligence of the model every time it receives a new training set data. What is the best way to avoid retraining all historical data? A Code sample would help me.

like image 407
Uma Sankar Avatar asked Mar 15 '26 13:03

Uma Sankar


2 Answers

You want to a look into incremental learning techniques for that. Many scikit-learn estimators have an option to do a partial_fit of the data, which means that you can incrementally train on small batches of data.

A common approach for these cases is to use SGDClassifier (or regressor), which is trained by taking a fraction of the samples to update the parameters of the model on each iteration, thus making it a natural candidate for online learning problems. However, you must retrain the model through the method partial_fit, otherwise it will train the whole model again.

From the documentation

SGD allows minibatch (online/out-of-core) learning, see the partial_fit method

Though as mentioned there are several other estimators in scikit-learn that have the partial-fit API implemented, as you can see in the section incremental learning, including MultinomialNB, linear_model.Perceptron and MiniBatchKMeans among others.


Here's a toy example to illustrate how it's used:

from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import SGDClassifier

X, y = load_iris(return_X_y=True)

clf = SGDClassifier()

kf = KFold(n_splits=2)
kf_splits = kf.split(X)

train_index, test_index = next(kf_splits)
# partial-fit with training data. On the first call
# the classes must be provided
clf.partial_fit(X[train_index], y[train_index], classes=np.unique(y))

# re-training on new data
train_index, test_index = next(kf_splits)
clf.partial_fit(X[train_index], y[train_index])
like image 106
yatu Avatar answered Mar 17 '26 02:03

yatu


What you are looking for is incremental learning, there is an excellent library called creme which helps you with that.

All the tools in the library can be updated with a single observation at a time, and can therefore be used to learn from streaming data.

Here are some benefits of using creme (and online machine learning in general):

Incremental: models can update themselves in real-time. Adaptive: models can adapt to concept drift. Production-ready: working with data streams makes it simple to replicate production scenarios during model development. Efficient: models don't have to be retrained and require little compute power, which lowers their carbon footprint Fast: when the goal is to learn and predict with a single instance at a time, then creme is a order of magnitude faster than PyTorch, Tensorflow, and scikit-learn. 🔥 Features

Check out this: https://pypi.org/project/creme/

like image 44
Ashish Dagar Avatar answered Mar 17 '26 03:03

Ashish Dagar