Saving a feature vector for new data in scikit-learn

Question

To create a machine learning algorithm I made a list of dictionaries and used scikit's DictVectorizer to make a feature vector for each item. I then created an SVM model from a dataset using part of the data for training and then testing the model on the test set (you know, the typical approach). Everything worked great and now I want to deploy the model into the wild and see how it works on new, unlabeled, unseen data. How do I save the feature vector so that the new data will have the same size/features and work with the SVM model? For example, if I want to train on the presence of words:

[{
 'contains(the)': 'True',
 'contains(cat)': 'True',
 'contains(is)': 'True',
 'contains(hungry)': 'True'
 }...
]

I train with a list that has the same sentence with thousands of animal variations. When I vectorize the list, it takes into account all the different animals mentioned and creates an index in the vector for each animal ('the', 'is' and 'hungry' don't change). Now when I try to use the model on a new sentence, I want to predict one item:

[{
 'contains(the)': 'True',
 'contains(emu)': 'True',
 'contains(is)': 'True',
 'contains(hungry)': 'True'
 }]

Without the original training set, when I use DictVectorizer it generates: (1,1,1,1). This is a couple thousand indexes short of the original vectors used to train my model, so the SVM model will not work with it. Or even if the length of the vector is right because it was trained on a massive sentence, the features may not correspond to the original values. How do I get new data to conform to the dimensions of the training vectors? There will never be more features than the training set, but not all features are guaranteed to be present in new data.

Is there a way to use pickle to save the feature vector? Or one methodI've considered would be to generate a dictionary that contains all the possible features with value 'False'. That forces new data into the proper vector size and only counts the items present in the new data.

I feel like I may not have described the problem adequately, so if something isn't clear I will attempt to explain it better. Thank you in advance!

EDIT: Thanks to larsman's answer, the solution was pretty simple:

from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
svm_clf = svm.SVC(kernel='linear')
vec_clf = Pipeline([('vectorizer', vec), ('svm', svm_clf)])
vec_clf.fit(X_Train,Y_Train)
joblib.dump(vec_clf, 'vectorizer_and_SVM.pkl')

The pipeline AND the support vector machine are trained to the data. Now all future models can unpickle the pipeline and have a feature vectorizer built into the SVM.

Fred Foo · Accepted Answer

How do I get new data to conform to the dimensions of the training vectors?

By using the transform method instead of fit_transform. The latter learns a new vocabulary from the data set you feed it.

Is there a way to use pickle to save the feature vector?

Pickle the trained vectorizer. Even better, make a Pipeline of the vectorizer and the SVM and pickle that. You can use sklearn.externals.joblib.dump for efficient pickling.

(Aside: the vectorizer is faster if you pass it the boolean True rather than the string "True".)

Saving a feature vector for new data in scikit-learn

Tags:

python

machine-learning

scikit-learn

Shakesbeery

1 Answers

Fred Foo

Recent Activity

Donate For Us

Saving a feature vector for new data in scikit-learn

Tags:

python

machine-learning

scikit-learn

Shakesbeery

1 Answers

Fred Foo

Related questions

Recent Activity

Donate For Us