As the title states: Is a <code>countvectorizer</code> the same as <code>tfidfvectorizer</code> with use_idf=false ? If not why not ? So does this also mean that adding the <code>tfidftransformer</code> here is redundant ? <pre class="prettyprint"><code>vect = CountVectorizer(min_df=1) tweets_vector = vect.fit_transform(corpus) tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector) tweets_vector_tf = tf_transformer.transform(tweets_vector) </code></pre>

No, they're not the same. <code>TfidfVectorizer</code> normalizes its results, i.e. each vector in its output has norm 1: <pre class="prettyprint"><code>>>> CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A array([[1, 1, 1, 0], [1, 0, 1, 1]]) >>> TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A array([[ 0.57735027, 0.57735027, 0.57735027, 0. ], [ 0.57735027, 0. , 0.57735027, 0.57735027]]) </code></pre> This is done so that dot-products on the rows are cosine similarities. Also <code>TfidfVectorizer</code> can use logarithmically discounted frequencies when given the option <code>sublinear_tf=True</code>. To make <code>TfidfVectorizer</code> behave as <code>CountVectorizer</code>, give it the constructor options <code>use_idf=False, normalize=None</code>.

Is a countvectorizer the same as tfidfvectorizer with use_idf=false?

Tags:

python

scikit-learn

As the title states: Is a countvectorizer the same as tfidfvectorizer with use_idf=false ? If not why not ?

So does this also mean that adding the tfidftransformer here is redundant ?

vect = CountVectorizer(min_df=1)
tweets_vector = vect.fit_transform(corpus)
tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector)
tweets_vector_tf = tf_transformer.transform(tweets_vector)

258

asked Mar 18 '14 19:03

Olivier_s_j

2 Answers

No, they're not the same. TfidfVectorizer normalizes its results, i.e. each vector in its output has norm 1:

>>> CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A
array([[1, 1, 1, 0],
       [1, 0, 1, 1]])
>>> TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A
array([[ 0.57735027,  0.57735027,  0.57735027,  0.        ],
       [ 0.57735027,  0.        ,  0.57735027,  0.57735027]])

This is done so that dot-products on the rows are cosine similarities. Also TfidfVectorizer can use logarithmically discounted frequencies when given the option sublinear_tf=True.

To make TfidfVectorizer behave as CountVectorizer, give it the constructor options use_idf=False, normalize=None.

answered Oct 10 '22 22:10

Fred Foo

As larsmans said, TfidfVectorizer(use_idf=False, normalize=None, ...) is supposed to behave the same as CountVectorizer.

In the current version (0.14.1), there's a bug where TfidfVectorizer(binary=True, ...) silently leaves binary=False, which can throw you off during a grid search for the best parameters. (CountVectorizer, in contrast, sets the binary flag correctly.) This appears to be fixed in future (post-0.14.1) versions.

answered Oct 10 '22 22:10

Rolf H Nelson

Related questions
                            
                                Python parse csv file - replace commas with colons
                            
                                Detect numbers in string
                            
                                Modular addition in python
                            
                                Deleting certain files using python
                            
                                Parsing binary files with Python
                            
                                How do I remove entries within a Counter object with a loop without invoking a RuntimeError?
                            
                                SQLAlchemy expects an object, but finds a Table
                            
                                Why doesn't this set comprehension work?
                            
                                Flask Python Model Validation
                            
                                assertRaises fails, even the callable raises the required exception (python, unitest)
                            
                                Python: iterating over list vs over dict items efficiency
                            
                                SqlAlchemy metaclass confusion
                            
                                Convert an integer to binary without using the built-in bin function
                            
                                Can I make matplotlib sliders more discrete?
                            
                                Sending a password over SSH or SCP with subprocess.Popen
                            
                                Generate correlated data in Python (3.3)
                            
                                How to join all the lines together in a text file in python?
                            
                                Python installation in Mac OS X virtual environment that includes a framework that I can include into Xcode?
                            
                                how to use a Python function with keyword "self" in arguments
                            
                                Installing win32gui python module [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With