Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get total number of potential results in Lucene

Tags:

lucene.net

I'm using lucene on a site of mine and I want to show the total result count from a query, for example:

Showing results x to y of z

But I can't find any method which will return me the total number of potential results. I can only seem to find methods which you have to specify the number of results you want, and since I only want 10 per page it seems logical to pass in 10 as the number of results.

Or am I doing this wrong, should I be passing in say 1000 and then just taking the 10 in the range that I require?

like image 223
Aaron Powell Avatar asked Apr 06 '10 23:04

Aaron Powell


2 Answers

BTW, since I know you personally I should point out for others I already knew you were referring to Lucene.net and not Lucene :) although the API would be the same

In versions prior to 2.9.x you could call IndexSearcher.Search(Query query, Filter filter) which returned a Hits object, one of which properties [methods, technically, due to the Java port] was Length()

This is now marked Obsolete since it will be removed in 3.0, the only results of Search return TopDocs or TopFieldDocs objects.

Your alternatives are

a) Use IndexServer.Search(Query query, int count) which will return a TopDocs object, so TopDocs.TotalHits will show you the total possible hits but at the expense of actually creating <count> results

b) A faster way is to implement your own Collector object (inherit from Lucene.Net.Search.Collector) and call IndexSearcher.Search(Query query, Collector collector). The search method will call Collect(int docId) on your collector on every match, so if internally you keep track of that you have a way of garnering all the results.

It should be noted Lucene is not a total-resultset query environment and is designed to stream the most relevant results to you (the developer) as fast as possible. Any method which gives you a "total results" count is just a wrapper enumerating over all the matches (as with the Collector method).

The trick is to keep this enumeration as fast as possible. The most expensive part is deserialisation of Documents from the index, populating each field etc. At least with the newer API design, requiring you to write your own Collector, the principle is made clear by telling the developer to avoid deserialising each result from the index since only matching document Ids and a score are provided by default.

like image 98
Alex Norcliffe Avatar answered Sep 22 '22 13:09

Alex Norcliffe


The top docs collector does this for you, for example

TopDocs topDocs = searcher.search(qry, 10); 
int totalHits = topDocs.totalHits ;

The above query will count all hits, but return only 10.

like image 44
Mikos Avatar answered Sep 22 '22 13:09

Mikos