I have a TfidfVectorizer
that vectorizes collection of articles followed by feature selection.
vectroizer = TfidfVectorizer()
X_train = vectroizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)
Now, I want to store this and use it in other programs. I don't want to re-run the TfidfVectorizer()
and the feature selector on the training dataset. How do I do that? I know how to make a model persistent using joblib
but I wonder if this is the same as making a model persistent.
You can simply use the built in pickle library:
import pickle
pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))
and load it with:
vectorizer = pickle.load(open("vectorizer.pickle", "rb"))
selector = pickle.load(open("selector.pickle", "rb"))
Pickle will serialize the objects to disk and load them in memory again when you need it
pickle lib docs
Here is my answer using joblib:
import joblib
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(selector, 'selector.pkl')
Later, I can load it and ready to go:
vectorizer = joblib.load('vectorizer.pkl')
selector = joblib.load('selector.pkl')
test = selector.trasnform(vectorizer.transform(['this is test']))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With