Save model for later prediction (OneVsRest)

Question

I would like to know how to save OnevsRest classifier model for later prediciton.

I have an issue saving it, since it implies saving the vectorizer as well. I have learnt in this post.

Here's the model I have created:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)

x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['id','comment_text'], axis=1)

x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['id','comment_text'], axis=1)


from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

%%time

# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])

for category in categories:
    printmd('**Processing {} comments...**'.format(category))

    # Training logistic regression model on train data
    LogReg_pipeline.fit(x_train, train[category])

    # calculating test accuracy
    prediction = LogReg_pipeline.predict(x_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
    print("
")

Any help will be very much appreciated.

Sincerely,

Luca Massaron · Accepted Answer

Using joblib you can save any Scikit-learn Pipeline complete of all its elements, therefore comprising also the fitted TfidfVectorizer.

Here I have rewritten your example using the first 200 examples of the Newsgroups20 dataset:

from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')

x_train = data.data[:100]
y_train = data.target[:100]

x_test =  data.data[100:200]
y_test = data.target[100:200]

# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('clf', OneVsRestClassifier(LogisticRegression(solver='sag', 
                                                   class_weight='balanced'), 
                                n_jobs=-1))
                           ])

# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, y_train)

In the above code you simply start defining your train and test data and you instantiate your TfidfVectorizer. You then define your pipeline comprising both the vectorizer and the OVR classifier and you fit it to the training data. It will learn to predict all the classes at once.

Now you simply save the entire fitted pipeline as it were a single predictor using joblib:

from joblib import dump, load
dump(LogReg_pipeline, 'LogReg_pipeline.joblib')

Your entire model is not saved to disk under the name 'LogReg_pipeline.joblib'. You can recall it and use it directly on raw data by this code snippet:

clf = load('LogReg_pipeline.joblib') 
clf.predict(x_test)

You will get the predictions on the raw text because the pipeline will vectorize it automatically.

Save model for later prediction (OneVsRest)

Tags:

python

save

scikit-learn

multilabel-classification

tfidfvectorizer

1 Answers

Luca Massaron

Recent Activity

Donate For Us

Save model for later prediction (OneVsRest)

Tags:

python

save

scikit-learn

multilabel-classification

tfidfvectorizer

1 Answers

Luca Massaron

Related questions

Recent Activity

Donate For Us