Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is a countvectorizer the same as tfidfvectorizer with use_idf=false?

As the title states: Is a countvectorizer the same as tfidfvectorizer with use_idf=false ? If not why not ?

So does this also mean that adding the tfidftransformer here is redundant ?

vect = CountVectorizer(min_df=1)
tweets_vector = vect.fit_transform(corpus)
tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector)
tweets_vector_tf = tf_transformer.transform(tweets_vector)
like image 258
Olivier_s_j Avatar asked Mar 18 '14 19:03

Olivier_s_j


People also ask

What is the difference between TfidfVectorizer and CountVectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

What is the difference between TfidfVectorizer and Tfidftransformer?

Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.

What is TF-IDF and count Vectorizer?

The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.

Why is TF-IDF a Vectorizer?

TF-IDF is one of the most popular text vectorizers, the calculation is very simple and easy to understand. It gives the rare term high weight and gives the common term low weight.


2 Answers

No, they're not the same. TfidfVectorizer normalizes its results, i.e. each vector in its output has norm 1:

>>> CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A
array([[1, 1, 1, 0],
       [1, 0, 1, 1]])
>>> TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A
array([[ 0.57735027,  0.57735027,  0.57735027,  0.        ],
       [ 0.57735027,  0.        ,  0.57735027,  0.57735027]])

This is done so that dot-products on the rows are cosine similarities. Also TfidfVectorizer can use logarithmically discounted frequencies when given the option sublinear_tf=True.

To make TfidfVectorizer behave as CountVectorizer, give it the constructor options use_idf=False, normalize=None.

like image 68
Fred Foo Avatar answered Oct 10 '22 22:10

Fred Foo


As larsmans said, TfidfVectorizer(use_idf=False, normalize=None, ...) is supposed to behave the same as CountVectorizer.

In the current version (0.14.1), there's a bug where TfidfVectorizer(binary=True, ...) silently leaves binary=False, which can throw you off during a grid search for the best parameters. (CountVectorizer, in contrast, sets the binary flag correctly.) This appears to be fixed in future (post-0.14.1) versions.

like image 29
Rolf H Nelson Avatar answered Oct 10 '22 22:10

Rolf H Nelson