Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save model for later prediction (OneVsRest)

I would like to know how to save OnevsRest classifier model for later prediciton.

I have an issue saving it, since it implies saving the vectorizer as well. I have learnt in this post.

Here's the model I have created:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)

x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['id','comment_text'], axis=1)

x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['id','comment_text'], axis=1)


from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

%%time

# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])

for category in categories:
    printmd('**Processing {} comments...**'.format(category))

    # Training logistic regression model on train data
    LogReg_pipeline.fit(x_train, train[category])

    # calculating test accuracy
    prediction = LogReg_pipeline.predict(x_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
    print("\n") 

Any help will be very much appreciated.

Sincerely,


1 Answers

Using joblib you can save any Scikit-learn Pipeline complete of all its elements, therefore comprising also the fitted TfidfVectorizer.

Here I have rewritten your example using the first 200 examples of the Newsgroups20 dataset:

from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')

x_train = data.data[:100]
y_train = data.target[:100]

x_test =  data.data[100:200]
y_test = data.target[100:200]

# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('clf', OneVsRestClassifier(LogisticRegression(solver='sag', 
                                                   class_weight='balanced'), 
                                n_jobs=-1))
                           ])

# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, y_train)

In the above code you simply start defining your train and test data and you instantiate your TfidfVectorizer. You then define your pipeline comprising both the vectorizer and the OVR classifier and you fit it to the training data. It will learn to predict all the classes at once.

Now you simply save the entire fitted pipeline as it were a single predictor using joblib:

from joblib import dump, load
dump(LogReg_pipeline, 'LogReg_pipeline.joblib') 

Your entire model is not saved to disk under the name 'LogReg_pipeline.joblib'. You can recall it and use it directly on raw data by this code snippet:

clf = load('LogReg_pipeline.joblib') 
clf.predict(x_test)

You will get the predictions on the raw text because the pipeline will vectorize it automatically.

like image 140
Luca Massaron Avatar answered Oct 20 '25 07:10

Luca Massaron



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!