I found the following code on the internet for calculating TFIDF: <pre class="prettyprint"><code>https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py </code></pre> I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: <pre class="prettyprint"><code>return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList)))) </code></pre> But i am confused for two things: <ol> <li>I get negative values in some cases, is this correct?</li> <li>I am confused with line 62, 63 and 64.</li> </ol> Code: <pre class="prettyprint"><code> documentNumber = 0 for word in documentList[documentNumber].split(None): words[word] = tfidf(word,documentList[documentNumber],documentList) </code></pre> Should TFIDF be calculated on the first document only?

<ol> <li>No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).</li> </ol>

TFIDF calculating confusion

Tags:

python

text-processing

data-mining

information-retrieval

tf-idf

I found the following code on the internet for calculating TFIDF:

https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py

I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error:

return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))

But i am confused for two things:

I get negative values in some cases, is this correct?
I am confused with line 62, 63 and 64.

Code:

 documentNumber = 0
  for word in documentList[documentNumber].split(None):
       words[word] = tfidf(word,documentList[documentNumber],documentList)

Should TFIDF be calculated on the first document only?

515

asked May 20 '13 11:05

badc0re

2 Answers

No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).

168

answered Oct 08 '22 10:10

Fred Foo

If the word in question is contained in every document in the collection your 1+ change will result in a negative value. As 0 < (x / (1 + x)) < 1 holds for all x > 0. Which results in a negative logarithm.

In my opinion the correct IDF for a nonexistent word is infinite or undefined, but by adding 1+ to the denominator and the nominator a nonexistent word will have an IDF slightly higher than any existing word and words that exist in every document will have an IDF of zero. Both cases will probably work well with your code.

answered Oct 08 '22 09:10

Omnidux

Related questions
                            
                                Errors installing Matplotlib - clang
                            
                                Flask, mod_wsgi, and Apache: ImportError
                            
                                pytest and coverage combination does not work
                            
                                django-social-auth redirect_uri invalid
                            
                                Easiest way to remove unicode representations from a string in python 3?
                            
                                Implementing sitemaps in Django
                            
                                Use an expression twice in one line - as a condition AND for string formatting?
                            
                                Speed-improvement on large pandas read_csv with datetime index
                            
                                What is a relatively simple way to determine the probability that a sentence is in English?
                            
                                Is there a way to convert pyplot.imshow() object to numpy array?
                            
                                How I can deserialize python pickles in C#?
                            
                                how does searchsort in python work?
                            
                                Finding intersection points of two ellipses (Python)
                            
                                efficient algorithm instead of looping
                            
                                Performance between "from package import *" and "import package"
                            
                                matplotlib linewidth when saving a PDF
                            
                                What is the difference between json.dumps/loads and tornado.escape.json_encode/json_decode?
                            
                                Find out into how many values a return value will be unpacked
                            
                                Python > Uncompyle2 - usage
                            
                                Sqlalchemy returns "stale" rows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With