Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit learn: forget previous train data

In scikit learn I have a model (in my case a Linear model)

clf = linear_model.LinearRegression()

I can train this model with some data

clf.fit(x1,y1)

But if I call again fit it will continue training the model.

clf.fit(x2,y2)

Now clf is a model trained with both (x1,y1) and (x2,y2)

If I want to start training from 0, I can create again the model by redefining clf

clf = linear_model.LinearRegression()
clf.fit(x1,y1)
# save the model
# ...
clf = linear_model.LinearRegression()
clf.fit(x2,y2)

However I don't want to define clf again:

Basically the type of regressor is chosen before, something like:

if params.linear_algorithm == 'least_squares':
    clf = linear_model.LinearRegression()
elif params.linear_algorithm == 'ridge':
    clf = linear_model.Ridge()
elif params.linear_algorithm == 'lasso':
    clf = linear_model.Lasso()

So I don't want inside my train function to redefine clf with all the conditional block, instead I just want to take clf, clean it from previous trainings and reuse it to train another set of data.

Does clf have a method to clean what has learned so far, so when I call clf.fit(x2,y2) is only trained on this data?

EDIT: You guys are right, the training is overwriten everytime.

My problem is that I'm saving the model in a dictionary, and it just take the reference to clf, so each time clf is retrained all previous saves are changed.

Redefining clf everytime creates a new object so each save points now so a different model

Example

for i in range(3):
   # get the x and y
   # ...
   clf.fit(x,y)
   model[i] = clf

Any idea how to save every time a different model instead of pointing all model[i] to the same clf?

like image 301
Sembei Norimaki Avatar asked Jul 31 '18 12:07

Sembei Norimaki


People also ask

What happens if you fit a model twice?

If you will execute model. fit(X_train, y_train) for a second time - it'll overwrite all previously fitted coefficients, weights, intercept (bias), etc. Some estimators (having the warm_start parameter) will reuse the solutions from the previous calls to fit() as initial solution in new call when warm_start = True .

What is partial fit in Sklearn?

partial_fit is a handy API that can be used to perform incremental learning in a mini-batch of an out-of-memory dataset. The primary purpose of using warm_state is to reducing training time when fitting the same dataset with different sets of hyperparameter values.

Does Scikitlearn use GPU?

scikit-learn is designed to be easy to install on a wide variety of platforms. Outside of neural networks, GPUs don't play a large role in machine learning today, and much larger gains in speed can often be achieved by a careful choice of algorithms. NVIDIA have released their own version of sklearn with GPU support.

Is Scikitlearn a library or package?

Scikit-learn is an open source data analysis library, and the gold standard for Machine Learning (ML) in the Python ecosystem. Key concepts and features include: Algorithmic decision-making methods, including: Classification: identifying and categorizing data based on patterns.


2 Answers

Your assumption is wrong. According to the Scikit-Learn docs:

Calling fit() more than once will overwrite what was learned by any previous fit().

You can therefore use your code safely and it will achieve what you need.

like image 96
Michele Tonutti Avatar answered Oct 02 '22 23:10

Michele Tonutti


I am pretty sure it overwrites any existing information from before. Scikit Learn docs specify that. Unless you use warm_start = True, fit() calls will overwrite existing data.

like image 24
Nikolas Pitsillos Avatar answered Oct 02 '22 23:10

Nikolas Pitsillos