Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mlflow How to save a sklearn pipeline with custom transformer?

I am trying to save with mlflow a sklearn machine-learning model, which is a pipeline containing a custom transformer I have defined, and load it in another project. My custom transformer inherits from BaseEstimator and TransformerMixin.

Let's say I have 2 projects:

  • train_project: it has the custom transformers in src.ml.transformers.py
  • use_project: it has other things in src, or has no src catalog at all

So in my train_project I do :

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe')

and then when I try to load it into use_project :

preprocess_pipe = mlflow.sklearn.load_model(f'{ref_model_path}/preprocess_pipe')

An error occurs :

[...]
File "/home/quentin/anaconda3/envs/api_env/lib/python3.7/site-packages/mlflow/sklearn.py", line 210, in _load_model_from_local_file
    return pickle.load(f)
ModuleNotFoundError: No module named 'train_project'

I tried to use format mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE :

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE)

but I get the same error during load.

I saw option code_path into mlflow.pyfunc.log_model but its use and purpose is not clear to me.

I thought mlflow provide a easy way to save model and serialize them so they can be used anywhere, Is that true only if you have native sklearn models (or keras, ...)?

It's seem that this issue is more related to pickle functioning (mlflow use it and pickle needs to have all dependencies installed).

The only solution I found so far is to make my transformer a package, import it in both project. Save version of my transformer library with conda_env argument of log_model, and check if it's same version when I load the model into my use_project. But it's painfull if I have to change my transformer or debug in it...

Is anybody have a better solution? More elegent? Maybe there is some mlflow functionality I would have missed?

other informations :
working on linux (ubuntu)
mlflow=1.5.0
python=3.7.3

I saw in test of mlflow.sklearn api that they do a test with custom transformer, but they load it into the same file so it seems not resolve my issue but maybe it can helps other poeple :

https://github.com/mlflow/mlflow/blob/master/tests/sklearn/test_sklearn_model_export.py

like image 300
aliene28 Avatar asked Mar 04 '20 16:03

aliene28


1 Answers

What you are trying to do is serialize something "customized" that you've trained in a module outside of train.py, correct?

What you probably will need to do is log your model with mlflow.pyfunc.log_model with the code argument, which takes in a list of strings containing the path to the modules you will need to deserialize and make predictions, as documented here.

What needs to be clear is that every mlflow model is a PyFunc by nature. Even when you log a model with mlflow.sklearn, you can load it with mlflow.pyfunc.load_model. And what a PyFunc does is standardize all models and frameworks in a unique way, that will guarantee you'll always declare how to:

  1. de-serialize your model, with the load_context() method
  2. make your predictions, with the predict() method

If you make sure about both things in an object that inherits mlflow's PythonModel class, you can then log your model as a PyFunc.

What mlflow.sklearn.log_model does is basically wrap up the way you declare serialization and de-serialization. If you stick with sklearn's basic modules, such as basic transformers and pipelines, you'll always be fine with it. But when you need something custom, then you refer to Pyfuncs instead.

You can find a very useful example here. Notice it states exactly how to make the predictions, transforming the input into a XGBoost's DMatrix.

like image 182
Murilo Mendonça Avatar answered May 29 '23 12:05

Murilo Mendonça