I want to save to disk an sklearn Pipeline including a custom Preprocessing and a RandomForestClassifier with all the dependencies inside the saved file.. Without this feature, I have to copy all the dependencies (custom modules) in the same folder everywhere I want to call this model (in my case on a remote server).
The preprocessor is defined in a class which lies in an other file (preprocessing.py) in the same folder of my project. So I get access to it through an import.
training.py
from preprocessing import Preprocessor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pickle
clf = Pipeline([
("preprocessing", Preprocessor()),
("model", RandomForestClassifier())
])
# some fitting of the classifier
# ...
# Export
with open(savepath, "wb") as handle:
pickle.dump(clf, handle, protocol=pickle.HIGHEST_PROTOCOL)
I tried pickle (and some of its variations), dill and joblib, but that did not work. When I import the .pkl somewhere else (say on my remote server). I must have an identical preprocessing.py in the architecture... which is a pain.
What I would love is to have another file somewhere else :
remote.py
import pickle
with open(savepath, "rb") as handle:
model = pickle.load(handle)
print(model.predict(some_matrix))
But this code currently gives me an error as it does not find the Preprocessor class...
In case your model contains large arrays of data, each array will be stored in a separate file, but the save and restore procedure will remain the same. we convert Python dictionary to a JSON string using JSON dumps. we need indented output so we provide indent parameter and set it to 4. Save the JSON string to a file.
To save the model all we need to do is pass the model object into the dump() function of Pickle. This will serialize the object and convert it into a “byte stream” that we can save as a file called model. pkl .
I'm facing an identical issue right now. To address the same, I am going to turn my pipeline/model along with all it's dependencies(preprocessing classes) into a Python module using setup tools so that it is self contained and can be run anywhere (remote server/docker container/VM.
I'm currently going through this process and if this is something you are interested in, I can respond with the additional steps spelled out as I make progress.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With