I built an scikit-learn model and I want to reuse in a daily python cron job (NB: no other platforms are involved - no R, no Java &c).
I pickled it (actually, I pickled my own object whose one field is a GradientBoostingClassifier
), and I un-pickle it in the cron job. So far so good (and has been discussed in Save classifier to disk in scikit-learn and Model persistence in Scikit-Learn?).
However, I upgraded sklearn
and now I get these warnings:
.../.local/lib/python2.7/site-packages/sklearn/base.py:315:
UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315:
UserWarning: Trying to unpickle estimator PriorProbabilityEstimator from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
.../.local/lib/python2.7/site-packages/sklearn/base.py:315:
UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
What do I do now?
I can downgrage to 0.18.1 and stick with it until I am ready to rebuild the model. For various reasons I find this unacceptable.
I can un-pickle the file and re-pickle it again. This worked with 0.18.2, but breaks with 0.19. NFG. joblib
looks no better.
I wish I could save the data in a version-independent ASCII format (e.g., JSON or XML). This is, obviously, the optimal solution, but there seems to be NO way to do that (see also Sklearn - model persistence without pkl file).
I could save the model to PMML, but its support is lukewarm at best:
I can use sklearn2pmml
to save the model (although not easily), and augustus
/lightpmmlpredictor
to apply (although not load) the model. However, none of those is available to pip
directly, which makes deployment a nightmare. Also, the augustus
& lightpmmlpredictor
projects seem to be dead. Importing PMML models into Python (Scikit-learn) - nope.
A variant of the above: save PMML using sklearn2pmml
, and use openscoring
for scoring. Requires interfacing with an external process. Yuk.
Suggestions?
Model persistence across different versions of scikit-learn is generally impossible. The reason is obvious: you pickle Class1
with one definition, and want to unpickle it into Class2
with another definition.
You can:
Class1
will work also for Class2
. GradientBoostingClassifier
and restore it from this serialized form, and hope that it would work better than pickle.I made an example of how you can convert a single DecisionTreeRegressor
into a pure list-and-dict format, fully JSON-compatible, and restore it back.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_classification
### Code to serialize and deserialize trees
LEAF_ATTRIBUTES = ['children_left', 'children_right', 'threshold', 'value', 'feature', 'impurity', 'weighted_n_node_samples']
TREE_ATTRIBUTES = ['n_classes_', 'n_features_', 'n_outputs_']
def serialize_tree(tree):
""" Convert a sklearn.tree.DecisionTreeRegressor into a json-compatible format """
encoded = {
'nodes': {},
'tree': {},
'n_leaves': len(tree.tree_.threshold),
'params': tree.get_params()
}
for attr in LEAF_ATTRIBUTES:
encoded['nodes'][attr] = getattr(tree.tree_, attr).tolist()
for attr in TREE_ATTRIBUTES:
encoded['tree'][attr] = getattr(tree, attr)
return encoded
def deserialize_tree(encoded):
""" Restore a sklearn.tree.DecisionTreeRegressor from a json-compatible format """
x = np.arange(encoded['n_leaves'])
tree = DecisionTreeRegressor().fit(x.reshape((-1,1)), x)
tree.set_params(**encoded['params'])
for attr in LEAF_ATTRIBUTES:
for i in range(encoded['n_leaves']):
getattr(tree.tree_, attr)[i] = encoded['nodes'][attr][i]
for attr in TREE_ATTRIBUTES:
setattr(tree, attr, encoded['tree'][attr])
return tree
## test the code
X, y = make_classification(n_classes=3, n_informative=10)
tree = DecisionTreeRegressor().fit(X, y)
encoded = serialize_tree(tree)
decoded = deserialize_tree(encoded)
assert (decoded.predict(X)==tree.predict(X)).all()
Having this, you can go on to serialize and deserialize the whole GradientBoostingClassifier
:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble.gradient_boosting import PriorProbabilityEstimator
def serialize_gbc(clf):
encoded = {
'classes_': clf.classes_.tolist(),
'max_features_': clf.max_features_,
'n_classes_': clf.n_classes_,
'n_features_': clf.n_features_,
'train_score_': clf.train_score_.tolist(),
'params': clf.get_params(),
'estimators_shape': list(clf.estimators_.shape),
'estimators': [],
'priors':clf.init_.priors.tolist()
}
for tree in clf.estimators_.reshape((-1,)):
encoded['estimators'].append(serialize_tree(tree))
return encoded
def deserialize_gbc(encoded):
x = np.array(encoded['classes_'])
clf = GradientBoostingClassifier(**encoded['params']).fit(x.reshape(-1, 1), x)
trees = [deserialize_tree(tree) for tree in encoded['estimators']]
clf.estimators_ = np.array(trees).reshape(encoded['estimators_shape'])
clf.init_ = PriorProbabilityEstimator()
clf.init_.priors = np.array(encoded['priors'])
clf.classes_ = np.array(encoded['classes_'])
clf.train_score_ = np.array(encoded['train_score_'])
clf.max_features_ = encoded['max_features_']
clf.n_classes_ = encoded['n_classes_']
clf.n_features_ = encoded['n_features_']
return clf
# test on the same problem
clf = GradientBoostingClassifier()
clf.fit(X, y);
encoded = serialize_gbc(clf)
decoded = deserialize_gbc(encoded)
assert (decoded.predict(X) == clf.predict(X)).all()
This works for scikit-learn v0.19, but don't ask me what will come in the next versions to break this code. I'm neither a prophet nor a developer of sklearn.
If you want to be fully independent of new versions of sklearn, the safest thing is to write a function that traverses a serialized tree and makes the prediction, instead of re-creating an sklearn tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With