Why is log used when calculating term frequency weight and IDF, inverse document frequency?

1 Answers

Debasis's answer is correct. I am not sure why he got downvoted.

Here is the intuition: If term frequency for the word 'computer' in doc1 is 10 and in doc2 it's 20, we can say that doc2 is more relevant than doc1 for the word 'computer.

However, if the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 millions, at this point, there is no much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'.

Just like Debasis's answer, adding log is to dampen the importance of term that has a high frequency, e.g. Using log base 2, the count of 1 million will be reduced to 19.9!

We also add 1 to the log(tf) because when tf is equal to 1, the log(1) is zero. By adding one, we distinguish between tf=0 and tf=1.

Hope this helps!

answered Oct 12 '22 13:10

suthee

Related questions
                            
                                Clustering of news articles
                            
                                How to extract Highlighted Parts from PDF files
                            
                                Document search on partial words
                            
                                What is the difference between a phrase query and using a shingle filter?
                            
                                Get image height and width of image stored on Amazon S3
                            
                                Relevance feedback in Apache Solr
                            
                                fuzzy string matching with term weights
                            
                                Reverse sort and argsort in python
                            
                                Getting total term frequency throughout entire index (Elasticsearch)
                            
                                TF-IDF implementations in python
                            
                                How to clear the cache in Solr?
                            
                                Effective 1-5 grams extraction with python
                            
                                Fast/Optimize N-gram implementations in python
                            
                                How to evaluate a search/retrieval engine using trec_eval?
                            
                                How to build a simple inverted index?
                            
                                How to correct the user input (Kind of google "did you mean?")
                            
                                Lucene's algorithm
                            
                                Wikipedia text download
                            
                                How to parse the data from Google Alerts?
                            
                                Cosine similarity and tf-idf

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

Tags:

information-retrieval

tf-idf

stevetronix

People also ask

1 Answers

suthee

Recent Activity

Donate For Us