Logo Questions Linux Laravel Mysql Ubuntu Git Menu

NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted

I am trying to build a sentiment analyzer using scikit-learn/pandas. Building and evaluating the model works, but attempting to classify new sample text does not.

My code:

import csv
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

infile = 'Sentiment_Analysis_Dataset.csv'
data = "SentimentText"
labels = "Sentiment"

class Classifier():
    def __init__(self):
        self.train_set, self.test_set = self.load_data()
        self.counts, self.test_counts = self.vectorize()
        self.classifier = self.train_model()

    def load_data(self):

        df = pd.read_csv(infile, header=0, error_bad_lines=False)
        train_set, test_set = train_test_split(df, test_size=.3)
        return train_set, test_set

    def train_model(self):
        classifier = BernoulliNB()
        targets = self.train_set[labels]
        classifier.fit(self.counts, targets)
        return classifier

    def vectorize(self):

        vectorizer = TfidfVectorizer(min_df=5,
                                 max_df = 0.8,
                                 ngram_range = (1,2),
        counts = vectorizer.fit_transform(self.train_set[data])
        test_counts = vectorizer.transform(self.test_set[data])

        return counts, test_counts

    def evaluate(self):
        test_counts,test_set = self.test_counts, self.test_set
        predictions = self.classifier.predict(test_counts)
        print (classification_report(test_set[labels], predictions))
        print ("The accuracy score is {:.2%}".format(accuracy_score(test_set[labels], predictions)))

    def classify(self, input):
        input_text = input

        input_vectorizer = TfidfVectorizer(min_df=5,
                                 max_df = 0.8,
                                 ngram_range = (1,2),
        input_counts = input_vectorizer.transform(input_text)
        predictions = self.classifier.predict(input_counts)

myModel = Classifier()

text = ['I like this I feel good about it', 'give me 5 dollars']


The error:

Traceback (most recent call last):
  File "sentiment.py", line 74, in <module>
  File "sentiment.py", line 66, in classify
    input_counts = input_vectorizer.transform(input_text)
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1380, in transform
    X = super(TfidfVectorizer, self).transform(raw_documents)
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 890, in transform
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 278, in _check_vocabulary
    check_is_fitted(self, 'vocabulary_', msg=msg),
  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/utils/validation.py", line 690, in check_is_fitted
    raise _NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

I'm not sure what the issue could be. In my classify method, I create a brand new vectorizer to process the text I want to classify, separate from the vectorizer used to create training and test data from the model.


like image 892
killer_manatee Avatar asked May 26 '17 03:05


People also ask

What is TF-IDF vocabulary?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...

What is the difference between Tfidftransformer and TfidfVectorizer?

Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.

What is TfidfVectorizer in Sklearn?

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

Does TfidfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's. Then we call fit_transform which does a few things: first, it creates a dictionary of 'known' words based on the input text given to it.

1 Answers

Save vectorizer as a pickle or joblib file and load it when you want to predict.

pickle.dump(vectorizer, open("vectorizer.pickle", "wb")) //Save vectorizer
pickle.load(open("models/vectorizer.pickle", 'rb'))     // Load vectorizer
like image 105
nr spider Avatar answered Oct 15 '22 13:10

nr spider