Why Lucene uses maxDoc instead of numDocs to compute term idf?

Question

I found this on Lucene's Similarity class public float idf(Term term, Searcher searcher) method javadoc:

Note that Searcher.maxDoc() is used instead of IndexReader#numDocs() because also Searcher.docFreq(Term) is used, and when the latter is inaccurate, so is Searcher.maxDoc(), and in the same direction. In addition, Searcher.maxDoc() is more efficient to compute.

This does not quite make sense to me. Does this have something to do with Document deletion in an IndexReader?

femtoRgon · Accepted Answer

Yes, exactly right. Whenever a document is deleted (or updated, since an update in Lucene is just a delete followed by an add), the document remains in the index until those segments are merged, often by an index optimize. It won't be returned by searches, having been deleted, but it's terms will still have an influence on idf scoring.

The LuceneFAQ has some information related to this, particularly in the last paragraph of this answer on deletion, and this addressing maxDoc specifically

Why Lucene uses maxDoc instead of numDocs to compute term idf?

Tags:

java

search

lucene

Yuhao

1 Answers

femtoRgon

Recent Activity

Donate For Us

Why Lucene uses maxDoc instead of numDocs to compute term idf?

Tags:

java

search

lucene

Yuhao

1 Answers

femtoRgon

Related questions

Recent Activity

Donate For Us