Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inverse Document Frequency Formula

I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect.

I keep reading that

idf(term) =  log(# of docs/ # of docs with term)

If so, won't you get a divide by zero error if there are no docs with the term?

To solve that problem, I read that you do

log (# of docs / # of docs with term + 1 )

But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me.

What am I not getting?

like image 244
George B Avatar asked Mar 24 '26 03:03

George B


1 Answers

The trick you describe is actually called Laplace smoothing (or additive, or add-by-one smoothing) and suppose to add the same summand to the other part of the fraction - nominator in your case or denominator in original case.

In other words, you should add 1 to the total number of docs:

log (# of docs + 1 / # of docs with term + 1)

Btw, it is often better to use smaller summand, especially in case of small corpus:

log (# of docs + a / # of docs with term + a),

where a = 0.001 or something like that.

like image 148
Nikita Astrakhantsev Avatar answered Mar 26 '26 16:03

Nikita Astrakhantsev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!