I trained a classifier on a set of short documents and pickled it after getting the reasonable f1 and accuracy scores for a binary classification task.
While training, I reduced the number of features using a sciki-learn countVectorizer cv:
    cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000) 
and then used the fit_transform() and transform() methods to obtain the transformed train and test sets:
    transformed_feat_train = numpy.zeros((0,0,))
    transformed_feat_test = numpy.zeros((0,0,))
    transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
    transformed_feat_test = cv.transform(testingTextFeat).toarray()
This all worked fine for training and testing the classifier. However, I am not sure how to use fit_transform() and transform() with a pickled version of the trained classifier for predicting the label of unseen, unlabeled data.
I am extracting the features on the unlabeled data exactly the same way I was doing while training/testing the classifier:
## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)
## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)
transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()
## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)
Error message:
    Traceback (most recent call last):
      File "../clf.py", line 615, in <module>
        if __name__=="__main__": main()
      File "../clf.py", line 579, in main
        cv.fit_transform(pickledClassifierFile)
      File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
        vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
      File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
        raise ValueError("empty vocabulary; perhaps the documents only"
    ValueError: empty vocabulary; perhaps the documents only contain stop words
                You should use the same vectorizer instance for transforming the training and test data. You can do that by creating a pipeline with the vectorizer + classifier, training the pipeline on the training set, pickling the whole pipeline. Later load the pickled pipeline and call predict on it.
See this related question: Bringing a classifier to production.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With