I would like to know how to save OnevsRest classifier model for later prediciton.
I have an issue saving it, since it implies saving the vectorizer as well. I have learnt in this post.
Here's the model I have created:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['id','comment_text'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['id','comment_text'], axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
%%time
# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
])
for category in categories:
printmd('**Processing {} comments...**'.format(category))
# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, train[category])
# calculating test accuracy
prediction = LogReg_pipeline.predict(x_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
print("\n")
Any help will be very much appreciated.
Sincerely,
Using joblib
you can save any Scikit-learn Pipeline
complete of all its elements, therefore comprising also the fitted TfidfVectorizer
.
Here I have rewritten your example using the first 200 examples of the Newsgroups20 dataset:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
x_train = data.data[:100]
y_train = data.target[:100]
x_test = data.data[100:200]
y_test = data.target[100:200]
# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
('vectorizer', vectorizer),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag',
class_weight='balanced'),
n_jobs=-1))
])
# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, y_train)
In the above code you simply start defining your train and test data and you instantiate your TfidfVectorizer
. You then define your pipeline comprising both the vectorizer and the OVR classifier and you fit it to the training data. It will learn to predict all the classes at once.
Now you simply save the entire fitted pipeline as it were a single predictor using joblib
:
from joblib import dump, load
dump(LogReg_pipeline, 'LogReg_pipeline.joblib')
Your entire model is not saved to disk under the name 'LogReg_pipeline.joblib'. You can recall it and use it directly on raw data by this code snippet:
clf = load('LogReg_pipeline.joblib')
clf.predict(x_test)
You will get the predictions on the raw text because the pipeline will vectorize it automatically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With