Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python scikit learn's TfidfVectorizer - max of 1.0?

I couldn't find the answer to this online, but are the results of tfidfVectorizer.fit_transform an array with max value of 1.0?

Because, with idf(term_i)= log (#number of docs/ number of docs containing term_i ), shouldn't idf, and subsequently tfidf, be > 1.0 in many cases?

i.e. Documents containing the word 'absinthe'. Say our term freq (tf) is 1, but idf is (1000 total documents/ 1 document containing 'absinthe') = 1000, 1*1000 = 1000, no?

But in my cases of using scikit-learn's TfidfVectorizer, the max value I get seems to be 1. Is it normalized?

like image 794
SpicyClubSauce Avatar asked Dec 27 '25 18:12

SpicyClubSauce


1 Answers

By default, the tfidf rows are L2 normalized. Here is the critical line in the source code.

if self.norm:
        X = normalize(X, norm=self.norm, copy=False)

normalize() comes from the sklearn.preprocessing module, where it indicates that it normalizes the rows by default. Here is the link to the normalize() docs.

like image 113
rabbit Avatar answered Dec 30 '25 06:12

rabbit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!