Fastest way to count all results in Lucene (java)

Question

What is the fastest way to count all results for a given Query in Lucene?

TopDocs.totalHits
implement and manage a Filter, using QueryFilter
implement a custom 'counting' Collector. This simply increments a count in the collect(int doc) method and returns true for the acceptsDocOutOfOrder() method. All other methods are NOOPS.

Since 1. will do scoring on all docs, and 2. could have an upfront hit due to loading of the FieldCache, I assume the answer is 3. It just seems odd that Lucene doesn't provide such a collector out of the box?

Robert Muir · Accepted Answer

The codes should be here now: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java

Xodarap · Answer

You are right that #3 will be quicker, but I don't think it's because of scoring. There is a much faster way, skip to the bottom if you don't care about the reasoning behind this.

The performance loss of #1 comes from the fact that the TopDocs collector will keep the docs in a priority queue, which means that you will lose some time sorting them by score. (You will also eat up some memory, but since you're storing just a heap of int+float pairs, it's probably pretty minimal.)

As to why Lucene doesn't provide this out of the box: you generally don't want to find all results. That's why when you search, you say to only find the top n results. There are strong theoretical reasons for this. Even Google says "Showing 25 of about n results."

So my advice to you is the following: if you have a reasonable number of results, then using TopDocs.totalHits won't be too bad performance-wise. If the totalHits method gives you problems, I don't think that a custom collector will be much better. (TopDocs.totalHits will run in n log n time, and the custom collector will be linear. Depending on your set up, the log n coefficient may be relevant, or it may not.)

So, if you absolutely need this functionality, and TopDocs.totalHits is too slow, I would recommend looking at the document frequency of the search terms. You could assume that frequency is independent (so p(A and B)=p(A)*p(B)) and make a pretty good guess from there. It will be very fast, because it's just a constant-time lookup for each term.

Fastest way to count all results in Lucene (java)

Tags:

java

performance

search

lucene

npellow

2 Answers

Robert Muir

Xodarap

Recent Activity

Donate For Us

Fastest way to count all results in Lucene (java)

Tags:

java

performance

search

lucene

npellow

2 Answers

Robert Muir

Xodarap

Related questions

Recent Activity

Donate For Us