I couldn't find the answer to this online, but are the results of tfidfVectorizer.fit_transform an array with max value of 1.0?
Because, with
idf(term_i)=
log (#number of docs/ number of docs containing term_i ), shouldn't idf, and subsequently tfidf, be > 1.0 in many cases?
i.e. Documents containing the word 'absinthe'. Say our term freq (tf) is 1, but idf is (1000 total documents/ 1 document containing 'absinthe') = 1000, 1*1000 = 1000, no?
But in my cases of using scikit-learn's TfidfVectorizer, the max value I get seems to be 1. Is it normalized?
By default, the tfidf rows are L2 normalized. Here is the critical line in the source code.
if self.norm:
X = normalize(X, norm=self.norm, copy=False)
normalize() comes from the sklearn.preprocessing module, where it indicates that it normalizes the rows by default. Here is the link to the normalize() docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With