To create a machine learning algorithm I made a list of dictionaries and used scikit's DictVectorizer to make a feature vector for each item. I then created an SVM model from a dataset using part of the data for training and then testing the model on the test set (you know, the typical approach). Everything worked great and now I want to deploy the model into the wild and see how it works on new, unlabeled, unseen data. How do I save the feature vector so that the new data will have the same size/features and work with the SVM model? For example, if I want to train on the presence of words:
[{
'contains(the)': 'True',
'contains(cat)': 'True',
'contains(is)': 'True',
'contains(hungry)': 'True'
}...
]
I train with a list that has the same sentence with thousands of animal variations. When I vectorize the list, it takes into account all the different animals mentioned and creates an index in the vector for each animal ('the', 'is' and 'hungry' don't change). Now when I try to use the model on a new sentence, I want to predict one item:
[{
'contains(the)': 'True',
'contains(emu)': 'True',
'contains(is)': 'True',
'contains(hungry)': 'True'
}]
Without the original training set, when I use DictVectorizer it generates: (1,1,1,1). This is a couple thousand indexes short of the original vectors used to train my model, so the SVM model will not work with it. Or even if the length of the vector is right because it was trained on a massive sentence, the features may not correspond to the original values. How do I get new data to conform to the dimensions of the training vectors? There will never be more features than the training set, but not all features are guaranteed to be present in new data.
Is there a way to use pickle to save the feature vector? Or one methodI've considered would be to generate a dictionary that contains all the possible features with value 'False'. That forces new data into the proper vector size and only counts the items present in the new data.
I feel like I may not have described the problem adequately, so if something isn't clear I will attempt to explain it better. Thank you in advance!
EDIT: Thanks to larsman's answer, the solution was pretty simple:
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
svm_clf = svm.SVC(kernel='linear')
vec_clf = Pipeline([('vectorizer', vec), ('svm', svm_clf)])
vec_clf.fit(X_Train,Y_Train)
joblib.dump(vec_clf, 'vectorizer_and_SVM.pkl')
The pipeline AND the support vector machine are trained to the data. Now all future models can unpickle the pipeline and have a feature vectorizer built into the SVM.
How do I get new data to conform to the dimensions of the training vectors?
By using the transform
method instead of fit_transform
. The latter learns a new vocabulary from the data set you feed it.
Is there a way to use pickle to save the feature vector?
Pickle the trained vectorizer. Even better, make a Pipeline
of the vectorizer and the SVM and pickle that. You can use sklearn.externals.joblib.dump
for efficient pickling.
(Aside: the vectorizer is faster if you pass it the boolean True
rather than the string "True"
.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With