Python scikit learn's TfidfVectorizer - max of 1.0?

Question

I couldn't find the answer to this online, but are the results of tfidfVectorizer.fit_transform an array with max value of 1.0?

Because, with idf(term_i)= log (#number of docs/ number of docs containing term_i ), shouldn't idf, and subsequently tfidf, be > 1.0 in many cases?

i.e. Documents containing the word 'absinthe'. Say our term freq (tf) is 1, but idf is (1000 total documents/ 1 document containing 'absinthe') = 1000, 1*1000 = 1000, no?

But in my cases of using scikit-learn's TfidfVectorizer, the max value I get seems to be 1. Is it normalized?

rabbit · Accepted Answer

By default, the tfidf rows are L2 normalized. Here is the critical line in the source code.

if self.norm:
        X = normalize(X, norm=self.norm, copy=False)

normalize() comes from the sklearn.preprocessing module, where it indicates that it normalizes the rows by default. Here is the link to the normalize() docs.

Python scikit learn's TfidfVectorizer - max of 1.0?

Tags:

python

nltk

tf-idf

SpicyClubSauce

1 Answers

rabbit

Recent Activity

Donate For Us

Python scikit learn's TfidfVectorizer - max of 1.0?

Tags:

python

nltk

tf-idf

SpicyClubSauce

1 Answers

rabbit

Related questions

Recent Activity

Donate For Us