I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a <code>Pipeline</code>. This model gets serialized using <code>joblib.dump</code> (really <code>numpy_pickle.dump</code>). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input. I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels. I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself: <pre class="prettyprint"><code>p = Pipeline([ ('binarizer', MultiLabelBinarizer()), ('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(clf)) ]) </code></pre> But I get the following error: <pre class="prettyprint"><code>model = p.fit(training_features, training_tags) *** TypeError: fit_transform() takes 2 positional arguments but 3 were given </code></pre> My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output. What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to <code>Pipeline</code> so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?

<s> Your intuition that you should add the <code>MultiLabelBinarizer</code> to the pipeline was the right way to solve this problem. It would have worked, except that <code>MultiLabelBinarizer.fit_transform</code> does not take the <code>fit_transform(self, X, y=None)</code> method signature which is now standard for sklearn estimators. Instead, it has a unique <code>fit_transform(self, y)</code> signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass <code>training_tags</code> as a third positional argument to a function with two positional arguments, which doesn't work. </s> The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following. <pre class="prettyprint"><code>class MyMLB(MultiLabelBinarizer): def fit_transform(self, X, y=None): return super(MultiLabelBinarizer, self).fit_transform(X) </code></pre> Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to <code>fit()</code> the pipeline, the last problem that you'll have is that your new <code>MyMLB</code> class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put <code>MyMLB</code> into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it. I misunderstood how the <code>MultiLabelBinarizer</code> worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative <code>fit_transform()</code> method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost! Based on your question, you're already comfortable with serializing your model to disk as [some form of] a <code>.pkl</code> file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply. <pre class="prettyprint"><code>X = <training_data> y = <training_labels> # Perform multi-label classification on class labels. mlb = MultiLabelBinarizer() multilabel_y = mlb.fit_transform(y) p = Pipeline([ ('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(clf)) ]) # Use multilabel classes to fit the pipeline. p.fit(X, multilabel_y) # Serialize both the pipeline and binarizer to disk. with open('my_sklearn_objects.pkl', 'wb') as f: pickle.dump((mlb, p), f) </code></pre> Then, after shipping the <code>.pkl</code> files to the remote box... <pre class="prettyprint"><code># Hydrate the serialized objects. with open('my_sklearn_objects.pkl', 'rb') as f: mlb, p = pickle.load(f) X = <input data> # Get your input data from somewhere. # Predict the classes using the pipeline mlb_predictions = p.predict(X) # Turn those classes into labels using the binarizer. classes = mlb.inverse_transform(mlb_predictions) # Do something with predicted classes. <...> </code></pre> <hr> Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a <code>tuple</code> as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that. Model serialization via <code>pickle</code> et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

save binarizer together with sklearn model

Tags:

version-control

machine-learning

scikit-learn

I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a Pipeline. This model gets serialized using joblib.dump (really numpy_pickle.dump). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input.

I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels.

I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself:

p = Pipeline([
('binarizer', MultiLabelBinarizer()),
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)), 
('tfidf', TfidfTransformer()), 
('clf', OneVsRestClassifier(clf))
])

But I get the following error:

model = p.fit(training_features, training_tags)
*** TypeError: fit_transform() takes 2 positional arguments but 3 were given

My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output.

What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to Pipeline so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?

864

asked Feb 02 '17 19:02

LateCoder

1 Answers

Your intuition that you should add the MultiLabelBinarizer to the pipeline was the right way to solve this problem. It would have worked, except that MultiLabelBinarizer.fit_transform does not take the fit_transform(self, X, y=None) method signature which is now standard for sklearn estimators. Instead, it has a unique fit_transform(self, y) signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass training_tags as a third positional argument to a function with two positional arguments, which doesn't work.

The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following.

class MyMLB(MultiLabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(MultiLabelBinarizer, self).fit_transform(X)

Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to fit() the pipeline, the last problem that you'll have is that your new MyMLB class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put MyMLB into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it.

I misunderstood how the MultiLabelBinarizer worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative fit_transform() method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost!

Based on your question, you're already comfortable with serializing your model to disk as [some form of] a .pkl file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply.

X = <training_data>
y = <training_labels>

# Perform multi-label classification on class labels.
mlb = MultiLabelBinarizer()
multilabel_y = mlb.fit_transform(y)

p = Pipeline([
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)), 
('tfidf', TfidfTransformer()), 
('clf', OneVsRestClassifier(clf))
])

# Use multilabel classes to fit the pipeline.
p.fit(X, multilabel_y)

# Serialize both the pipeline and binarizer to disk.
with open('my_sklearn_objects.pkl', 'wb') as f:
    pickle.dump((mlb, p), f)

Then, after shipping the .pkl files to the remote box...

# Hydrate the serialized objects.
with open('my_sklearn_objects.pkl', 'rb') as f:
    mlb, p = pickle.load(f)

X = <input data> # Get your input data from somewhere.

# Predict the classes using the pipeline
mlb_predictions = p.predict(X)

# Turn those classes into labels using the binarizer.
classes = mlb.inverse_transform(mlb_predictions)

# Do something with predicted classes.
<...>

Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a tuple as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that.

Model serialization via pickle et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

111

answered Sep 28 '22 08:09

rileymcdowell

Related questions
                            
                                Difference between regression tree and model tree
                            
                                Does R randomForest's rfcv method actually say which features it selected, or not?
                            
                                Caluculating IDF(Inverse Document Frequency) for document categorization
                            
                                How to derive KDD99 Features from DARPA pcap file? [closed]
                            
                                feature selection and cross validation
                            
                                Anomaly Detection vs Supervised Learning
                            
                                Matlab: How can I split my data matrix into two random subsets of column vectors while keeping the label information?
                            
                                Finding a corresponding leaf node for each data point in a decision tree (scikit-learn)
                            
                                Understanding softmax classifier
                            
                                TensorFlow MLP not training XOR
                            
                                Handling unassigned (null) values of features in regression (machine learning)?
                            
                                Error in running h2o.ensemble
                            
                                How to use CNN to train input data of different size?
                            
                                RandomForestClassifier was given input with invalid label column error in Apache Spark
                            
                                High-dimensional data structure in Python
                            
                                Is it ok to only use one epoch?
                            
                                Keras ImageDataGenerator setting mean and std
                            
                                Machine Learning: Why xW+b instead of Wx+b?
                            
                                Spark 2.0 ALS Recommendation how to recommend to a user
                            
                                Possible/maybe category in deep learning

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With