Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pre train a model (classifier) in scikit learn

I would like to pre-train a model and then train it with another model.

I have model Decision Tree Classifer and then I would like to train it further with model LGBM Classifier. Is there a possibility to do this in scikit learn? I have already read this post about it https://datascience.stackexchange.com/questions/28512/train-new-data-to-pre-trained-model.. In the post it says

As per the official documentation, calling fit() more than once will overwrite what was learned by any previous fit()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 

# Train Decision Tree Classifer
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

lgbm = lgb.LGBMClassifier()
lgbm = lgbm.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = lgbm.predict(X_test)
like image 432
Test Avatar asked Nov 28 '21 17:11

Test


People also ask

What is sklearn preprocessing?

The sklearn. preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. In general, learning algorithms benefit from standardization of the data set.

What are the classifiers in sklearn?

Decision Tree Classifiers/Random Forests. Naive Bayes. Linear Discriminant Analysis. Logistic Regression.

What is CLF in scikit-learn?

In the scikit-learn tutorial, it's short for classifier.: We call our estimator instance clf , as it is a classifier.


3 Answers

Perhaps you are looking for stacked classifiers.

In this approach, the predictions of earlier models are available as features for later models.

Look into StackingClassifiers.

Adapted from the documentation:

from sklearn.ensemble import StackingClassifier

estimators = [
     ('dtc_model', DecisionTreeClassifier()),
 ]

clf = StackingClassifier(
                estimators=estimators, 
                final_estimator=LGBMClassifier()
      )
like image 93
MYK Avatar answered Oct 29 '22 07:10

MYK


Unfortunately this is not possible at present. According to the doc at https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html?highlight=init_model, you can continue training the model if the model is from lightgbm.

I did try this setup with:

# dtc
dtc_model = DecisionTreeClassifier()
dtc_model = dtc_model.fit(X_train, y_train)
    
# save
dtc_fn = 'dtc.pickle.db'
pickle.dump(dtc_model, open(dtc_fn, 'wb'))
    
# lgbm
lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train_2, y_train_2, init_model=dtc_fn)

And I get:

LightGBMError: Unknown model format or submodel type in model file dtc.pickle.db
like image 45
ferdy Avatar answered Oct 29 '22 05:10

ferdy


As @Ferdy explained in his post, there is no simple way to perform this operation and it is understandable.

Scikit-learn DecisionTreeClassifier takes only numerical features and cannot handle nan values whereas LGBMClassifier can handle those.

By looking at the decision function of scikit-learn you can see that all it can perform is splits based on feature <= threshold.

On the contrary LGBM can perform the following:

  • feature is na
  • feature <= threshold
  • feature in categories

Splits in decision tree are selected at each step as they best splits the set of items. They try to minimize the node impurity (giny) or entropy.

The risk of further training a DecisionTreeClassifier is that you are not sure that splits performed in the original tree are the best, since you have new splits capabilities with LGBM that might/should lead in better performance.

I would recommend you to retrain the model with LGBMClassifier only as it might be possible that splits will be different from the original scikit-learn Tree.

like image 1
Antoine Dubuis Avatar answered Oct 29 '22 06:10

Antoine Dubuis