I would like to pre-train a model and then train it with another model.
I have model Decision Tree Classifer
and then I would like to train it further with model LGBM Classifier
. Is there a possibility to do this in scikit learn?
I have already read this post about it https://datascience.stackexchange.com/questions/28512/train-new-data-to-pre-trained-model.. In the post it says
As per the official documentation, calling fit() more than once will overwrite what was learned by any previous fit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Train Decision Tree Classifer
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
lgbm = lgb.LGBMClassifier()
lgbm = lgbm.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = lgbm.predict(X_test)
The sklearn. preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. In general, learning algorithms benefit from standardization of the data set.
Decision Tree Classifiers/Random Forests. Naive Bayes. Linear Discriminant Analysis. Logistic Regression.
In the scikit-learn tutorial, it's short for classifier.: We call our estimator instance clf , as it is a classifier.
Perhaps you are looking for stacked classifiers.
In this approach, the predictions of earlier models are available as features for later models.
Look into StackingClassifiers.
Adapted from the documentation:
from sklearn.ensemble import StackingClassifier
estimators = [
('dtc_model', DecisionTreeClassifier()),
]
clf = StackingClassifier(
estimators=estimators,
final_estimator=LGBMClassifier()
)
Unfortunately this is not possible at present. According to the doc at https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html?highlight=init_model, you can continue training the model if the model is from lightgbm.
I did try this setup with:
# dtc
dtc_model = DecisionTreeClassifier()
dtc_model = dtc_model.fit(X_train, y_train)
# save
dtc_fn = 'dtc.pickle.db'
pickle.dump(dtc_model, open(dtc_fn, 'wb'))
# lgbm
lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train_2, y_train_2, init_model=dtc_fn)
And I get:
LightGBMError: Unknown model format or submodel type in model file dtc.pickle.db
As @Ferdy explained in his post, there is no simple way to perform this operation and it is understandable.
Scikit-learn DecisionTreeClassifier
takes only numerical features and cannot handle nan
values whereas LGBMClassifier
can handle those.
By looking at the decision function of scikit-learn
you can see that all it can perform is splits based on feature <= threshold
.
On the contrary LGBM
can perform the following:
feature is na
feature <= threshold
feature in categories
Splits in decision tree are selected at each step as they best splits the set of items. They try to minimize the node impurity (giny) or entropy.
The risk of further training a DecisionTreeClassifier
is that you are not sure that splits performed in the original tree are the best, since you have new splits capabilities with LGBM
that might/should lead in better performance.
I would recommend you to retrain the model with LGBMClassifier
only as it might be possible that splits will be different from the original scikit-learn
Tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With