Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Lucene uses maxDoc instead of numDocs to compute term idf?

I found this on Lucene's Similarity class public float idf(Term term, Searcher searcher) method javadoc:

Note that Searcher.maxDoc() is used instead of IndexReader#numDocs() because also Searcher.docFreq(Term) is used, and when the latter is inaccurate, so is Searcher.maxDoc(), and in the same direction. In addition, Searcher.maxDoc() is more efficient to compute.

This does not quite make sense to me. Does this have something to do with Document deletion in an IndexReader?

like image 878
Yuhao Avatar asked May 31 '13 06:05

Yuhao


1 Answers

Yes, exactly right. Whenever a document is deleted (or updated, since an update in Lucene is just a delete followed by an add), the document remains in the index until those segments are merged, often by an index optimize. It won't be returned by searches, having been deleted, but it's terms will still have an influence on idf scoring.

The LuceneFAQ has some information related to this, particularly in the last paragraph of this answer on deletion, and this addressing maxDoc specifically

like image 171
femtoRgon Avatar answered Oct 13 '22 23:10

femtoRgon