Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does TfidfVectorizer compute scores on test data

In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data. The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document.

However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either:

  1. The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the training set.
  2. The new document is 'added' to the existing corpus and new scores are calculated.

I have tried deducing the operation from scikit-learn's source code but could not quite figure it out. Is it one of the options I've previously mentioned or something else entirely? Please assist.

like image 321
Yuval Cohen Avatar asked Apr 16 '19 11:04

Yuval Cohen


People also ask

How do I get my TF-IDF score?

As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

How is TF-IDF Sklearn calculated?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

What does TfidfVectorizer return?

TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator. vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

What does TF-IDF transform do?

TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It's a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks.

When to use tfidftransformer vs tfidfvectorizer?

Here is a general guideline: 1 If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer. 2 If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer 3 If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

How do I get the tf-idf scores for my Docs?

Then, by invoking tfidf_transformer.transform (count_vector) you will finally be computing the tf-idf scores for your docs. Internally this is computing the tf * idf multiplication where your term frequency is weighted by its IDF values. Now, let’s print the tf-idf values of the first document to see if it makes sense.

Is it possible to train TFIDF on the test corpus?

When training a model it is possible to train the Tfidf on the corpus of only the training set or also on the test set. It seems not to make sense to include the test corpus when training the model, though since it is not supervised, it is also possible to train it on the whole corpus. What is better to do? Show activity on this post.

How to use tfidftransformer to count words?

? 2. Initialize CountVectorizer In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The code below does just that. ? Now, let’s check the shape.


1 Answers

It is definitely the former: each word's idf (inverse document-frequency) is calculated based on the training documents only. This makes sense because these values are precisely the ones that are calculated when you call fit on your vectorizer. If the second option you describe was true, we would essentially refit a vectorizer each time, and we would also cause information leak as idf's from the test set would be used during model evaluation.

Beyond these purely conceptual explanations, you can also run the following code to convince yourself:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x_train = ["We love apples", "We really love bananas"]
vect.fit(x_train)
print(vect.get_feature_names())
>>> ['apples', 'bananas', 'love', 'really', 'we']

x_test = ["We really love pears"]

vectorized = vect.transform(x_test)
print(vectorized.toarray())
>>> array([[0.        , 0.        , 0.50154891, 0.70490949, 0.50154891]])

Following the reasoning of how the fit methodology works, you can recalculate these tfidf values yourself:

"apples" and "bananas" obviously have a tfidf score of 0 because they do not appear in x_test. "pears", on the other hand, does not exist in x_train and so will not even appear in the vectorization. Hence, only "love", "really" and "we" will have a tfidf score.

Scikit-learn implements tfidf as log((1+n)/(1+df) + 1) * f where n is the number of documents in the training set (2 for us), df the number of documents in which the word appears in the training set only, and f the frequency count of the word in the test set. Hence:

tfidf_love = (np.log((1+2)/(1+2))+1)*1
tfidf_really = (np.log((1+2)/(1+1))+1)*1
tfidf_we = (np.log((1+2)/(1+2))+1)*1

You then need to scale these tfidf scores by the L2 distance of your document:

tfidf_non_scaled = np.array([tfidf_love,tfidf_really,tfidf_we])
tfidf_list = tfidf_non_scaled/sum(tfidf_non_scaled**2)**0.5

print(tfidf_list)
>>> [0.50154891 0.70490949 0.50154891]

You can see that indeed, we are getting the same values, which confirms the way scikit-learn implemented this methodology.

like image 153
MaximeKan Avatar answered Oct 13 '22 20:10

MaximeKan