Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

save binarizer together with sklearn model

I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a Pipeline. This model gets serialized using joblib.dump (really numpy_pickle.dump). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input.

I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels.

I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself:

p = Pipeline([
('binarizer', MultiLabelBinarizer()),
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)), 
('tfidf', TfidfTransformer()), 
('clf', OneVsRestClassifier(clf))
])

But I get the following error:

model = p.fit(training_features, training_tags)
*** TypeError: fit_transform() takes 2 positional arguments but 3 were given

My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output.

What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to Pipeline so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?

like image 864
LateCoder Avatar asked Feb 02 '17 19:02

LateCoder


People also ask

What is multi label Binarizer?

Multilabelbinarizer allows you to encode multiple labels per instance. To translate the resulting array, you could build a DataFrame with this array and the encoded classes (through its "classes_" attribute). binarizer = MultiLabelBinarizer() pd.DataFrame(binarizer.fit_transform(y), columns=binarizer.classes_)

What is threshold Binarizer?

With the default threshold of 0, only positive values map to 1. Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

Why do we use label Binarizer?

Label Binarizer is an SciKit Learn class that accepts Categorical data as input and returns an Numpy array. Unlike Label Encoder, it encodes the data into dummy variables indicating the presence of a particular label or not. Encoding make column data using Label Binarizer.


1 Answers

Your intuition that you should add the MultiLabelBinarizer to the pipeline was the right way to solve this problem. It would have worked, except that MultiLabelBinarizer.fit_transform does not take the fit_transform(self, X, y=None) method signature which is now standard for sklearn estimators. Instead, it has a unique fit_transform(self, y) signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass training_tags as a third positional argument to a function with two positional arguments, which doesn't work.

The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following.

class MyMLB(MultiLabelBinarizer):
    def fit_transform(self, X, y=None):
        return super(MultiLabelBinarizer, self).fit_transform(X)

Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to fit() the pipeline, the last problem that you'll have is that your new MyMLB class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put MyMLB into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it.

I misunderstood how the MultiLabelBinarizer worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative fit_transform() method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost!

Based on your question, you're already comfortable with serializing your model to disk as [some form of] a .pkl file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply.

X = <training_data>
y = <training_labels>

# Perform multi-label classification on class labels.
mlb = MultiLabelBinarizer()
multilabel_y = mlb.fit_transform(y)

p = Pipeline([
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)), 
('tfidf', TfidfTransformer()), 
('clf', OneVsRestClassifier(clf))
])

# Use multilabel classes to fit the pipeline.
p.fit(X, multilabel_y)

# Serialize both the pipeline and binarizer to disk.
with open('my_sklearn_objects.pkl', 'wb') as f:
    pickle.dump((mlb, p), f)

Then, after shipping the .pkl files to the remote box...

# Hydrate the serialized objects.
with open('my_sklearn_objects.pkl', 'rb') as f:
    mlb, p = pickle.load(f)

X = <input data> # Get your input data from somewhere.

# Predict the classes using the pipeline
mlb_predictions = p.predict(X)

# Turn those classes into labels using the binarizer.
classes = mlb.inverse_transform(mlb_predictions)

# Do something with predicted classes.
<...>

Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a tuple as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that.

Model serialization via pickle et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

like image 111
rileymcdowell Avatar answered Sep 28 '22 08:09

rileymcdowell