Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to count all results in Lucene (java)

What is the fastest way to count all results for a given Query in Lucene?

  1. TopDocs.totalHits
  2. implement and manage a Filter, using QueryFilter
  3. implement a custom 'counting' Collector. This simply increments a count in the collect(int doc) method and returns true for the acceptsDocOutOfOrder() method. All other methods are NOOPS.

Since 1. will do scoring on all docs, and 2. could have an upfront hit due to loading of the FieldCache, I assume the answer is 3. It just seems odd that Lucene doesn't provide such a collector out of the box?

like image 883
npellow Avatar asked Feb 07 '11 07:02

npellow


2 Answers

The codes should be here now: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java

like image 76
Robert Muir Avatar answered Oct 04 '22 04:10

Robert Muir


You are right that #3 will be quicker, but I don't think it's because of scoring. There is a much faster way, skip to the bottom if you don't care about the reasoning behind this.

The performance loss of #1 comes from the fact that the TopDocs collector will keep the docs in a priority queue, which means that you will lose some time sorting them by score. (You will also eat up some memory, but since you're storing just a heap of int+float pairs, it's probably pretty minimal.)

As to why Lucene doesn't provide this out of the box: you generally don't want to find all results. That's why when you search, you say to only find the top n results. There are strong theoretical reasons for this. Even Google says "Showing 25 of about n results."

So my advice to you is the following: if you have a reasonable number of results, then using TopDocs.totalHits won't be too bad performance-wise. If the totalHits method gives you problems, I don't think that a custom collector will be much better. (TopDocs.totalHits will run in n log n time, and the custom collector will be linear. Depending on your set up, the log n coefficient may be relevant, or it may not.)

So, if you absolutely need this functionality, and TopDocs.totalHits is too slow, I would recommend looking at the document frequency of the search terms. You could assume that frequency is independent (so p(A and B)=p(A)*p(B)) and make a pretty good guess from there. It will be very fast, because it's just a constant-time lookup for each term.

like image 23
Xodarap Avatar answered Oct 04 '22 04:10

Xodarap