Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TFIDF calculating confusion

I found the following code on the internet for calculating TFIDF:

https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py

I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error:

return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))

But i am confused for two things:

  1. I get negative values in some cases, is this correct?
  2. I am confused with line 62, 63 and 64.

Code:

 documentNumber = 0
  for word in documentList[documentNumber].split(None):
       words[word] = tfidf(word,documentList[documentNumber],documentList)

Should TFIDF be calculated on the first document only?

like image 515
badc0re Avatar asked May 20 '13 11:05

badc0re


People also ask

What is TF-IDF write its formula?

Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.

How does NLP calculate TF-IDF?

The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.

What are the disadvantages of TF-IDF?

However, TF-IDF has several limitations: – It computes document similarity directly in the word-count space, which may be slow for large vocabularies. – It assumes that the counts of different words provide independent evidence of similarity. – It makes no use of semantic similarities between words.


2 Answers

  1. No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).
like image 168
Fred Foo Avatar answered Oct 08 '22 10:10

Fred Foo


If the word in question is contained in every document in the collection your 1+ change will result in a negative value. As 0 < (x / (1 + x)) < 1 holds for all x > 0. Which results in a negative logarithm.

In my opinion the correct IDF for a nonexistent word is infinite or undefined, but by adding 1+ to the denominator and the nominator a nonexistent word will have an IDF slightly higher than any existing word and words that exist in every document will have an IDF of zero. Both cases will probably work well with your code.

like image 39
Omnidux Avatar answered Oct 08 '22 09:10

Omnidux