Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I need sklearn in docker container if I already have the model as a pickle?

I pickled a model and want to expose only the prediction api written in Flask. However when I write a dockerfile to make a image without sklearn in it, I get an error ModuleNotFoundError: No module named 'sklearn.xxxx' where xxx refers to sklearn's ML algorithm classes, at the point where I am loading the model using pickle like classifier = pickle.load(f).

When I rewrite the dockerfile to make an image that has sklearn too, then I don't get the error even though in the API I never import sklearn.
My concept of pickling is very simple, that it will serialize the classifier class with all of its data. So when we unpickle it, since the classifier class already has a predict attribute, we can just call it. Why do I need to have sklearn in the environment?

like image 843
jar Avatar asked Aug 31 '25 16:08

jar


1 Answers

You have a misconception of how pickle works.

It does not seralize anything, except of instance state (__dict__ by default, or custom implementation). When unpickling, it just tries to create instance of corresponding class (here goes your import error) and set pickled state.

There's a reason for this: you don't know beforehand what methods will be used after load, so you can not pickle implementation. In addition to this, in pickle time you can not build some AST to see what methods/modules will be needed after deserializing, and main reason for this is dynamic nature of python — your implementation can actually vary depending on input.

After all, even assuming that theoretically we'd have smart self-contained pickle serialization, it will be actual model + sklearn in single file, with no proper way to manage it.

like image 148
Slam Avatar answered Sep 02 '25 05:09

Slam