Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

The formula for IDF is log( N / df t ) instead of just N / df t.

Where N = total documents in collection, and df t = document frequency of term t.

Log is said to be used because it “dampens” the effect of IDF. What does this mean?

Also, why do we use log frequency weighing for term frequency as seen here:

enter image description here

like image 283
stevetronix Avatar asked Nov 21 '14 18:11

stevetronix


People also ask

What is log in TF-IDF?

In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Why do we use inverse document frequency in TF-IDF weighting?

The reason we need IDF is to help correct for words like “of”, “as”, “the”, etc. since they appear frequently in an English corpus. Thus by taking inverse document frequency, we can minimize the weighting of frequent terms while making infrequent terms have a higher impact.

How do you calculate frequency and inverse document frequency?

The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.

How is IDF inverse document frequency mathematically calculated?

the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.


1 Answers

Debasis's answer is correct. I am not sure why he got downvoted.

Here is the intuition: If term frequency for the word 'computer' in doc1 is 10 and in doc2 it's 20, we can say that doc2 is more relevant than doc1 for the word 'computer.

However, if the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 millions, at this point, there is no much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'.

Just like Debasis's answer, adding log is to dampen the importance of term that has a high frequency, e.g. Using log base 2, the count of 1 million will be reduced to 19.9!

We also add 1 to the log(tf) because when tf is equal to 1, the log(1) is zero. By adding one, we distinguish between tf=0 and tf=1.

Hope this helps!

like image 56
suthee Avatar answered Oct 12 '22 13:10

suthee