Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I store a TfidfVectorizer for future use in scikit-learn?

Tags:

I have a TfidfVectorizer that vectorizes collection of articles followed by feature selection.

vectroizer = TfidfVectorizer()
X_train = vectroizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)

Now, I want to store this and use it in other programs. I don't want to re-run the TfidfVectorizer() and the feature selector on the training dataset. How do I do that? I know how to make a model persistent using joblib but I wonder if this is the same as making a model persistent.

like image 323
user2161903 Avatar asked Sep 24 '15 15:09

user2161903


2 Answers

You can simply use the built in pickle library:

import pickle
pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))

and load it with:

vectorizer = pickle.load(open("vectorizer.pickle", "rb"))
selector = pickle.load(open("selector.pickle", "rb"))

Pickle will serialize the objects to disk and load them in memory again when you need it

pickle lib docs

like image 99
Marco Ferragina Avatar answered Sep 21 '22 16:09

Marco Ferragina


Here is my answer using joblib:

import joblib
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(selector, 'selector.pkl')

Later, I can load it and ready to go:

vectorizer = joblib.load('vectorizer.pkl')
selector = joblib.load('selector.pkl')

test = selector.trasnform(vectorizer.transform(['this is test']))
like image 24
user2161903 Avatar answered Sep 17 '22 16:09

user2161903