I pickled a model and want to expose only the prediction
api written in Flask
. However when I write a dockerfile to make a image without sklearn
in it, I get an error ModuleNotFoundError: No module named 'sklearn.xxxx'
where xxx
refers to sklearn's ML algorithm classes, at the point where I am loading the model using pickle like classifier = pickle.load(f)
.
When I rewrite the dockerfile to make an image that has sklearn
too, then I don't get the error even though in the API I never import sklearn
.
My concept of pickling is very simple, that it will serialize the classifier class with all of its data. So when we unpickle it, since the classifier class already has a predict
attribute, we can just call it. Why do I need to have sklearn
in the environment?
You have a misconception of how pickle works.
It does not seralize anything, except of instance state (__dict__
by default, or custom implementation). When unpickling, it just tries to create instance of corresponding class (here goes your import error) and set pickled state.
There's a reason for this: you don't know beforehand what methods will be used after load
, so you can not pickle implementation. In addition to this, in pickle time you can not build some AST to see what methods/modules will be needed after deserializing, and main reason for this is dynamic nature of python — your implementation can actually vary depending on input.
After all, even assuming that theoretically we'd have smart self-contained pickle serialization, it will be actual model + sklearn in single file, with no proper way to manage it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With