While reading "Lucene in Action 2nd edition" I came across the description of Filter
classes which are could be used for result filtering in Lucene. Lucene has a lot of filters repeating Query
classes. For example, NumericRangeQuery
and NumericRangeFilter
.
The book says that NRF
does exactly the same as NRQ
but without document scoring. Does this means that if I do not need scoring or sort documents by document field value I should prefer Filter
ing over Query
ing from performance point of view?
I receive a great answer from Uwe Schindler, let me repost it here.
If you dont cache filters, queries will be faster, as the ConjunctionScorer in Lucene has optimizations, which are currently not used for Filters. Filters are fine, if you cache them (e.g. if you always have the same access restrictions for a specific user that are applied to all his queries). In that case the Filter is only executed once and cached for all further requests and then intersected with the query result set.
If you only want to e.g. randomly "filter" e.g. by a variable numeric range like a bounding box in a geographic search, use queries, queries are in most cases faster (e.g. Range Queries and similar stuff - called MultiTermQueries - are internally also implemented by the same BitSet algorithm like the Filter - in fact they are only Filters wrapped by a Scorer-impl). But the Scorer that ANDs the query and your "filter" query together (ConjunctionScorer) is generally faster than the code that applies the filter after searching. This may some improvement possible, but in general filters are something in Lucene that is not really needed anymore, so there were already some approaches to make Filters and Queries the same, and instead then be able to also cache non-scoring queries. This would make lots of code easier.
Filters can bring a huge speed improvement with Lucene 4.0, if they are plugged ontop of the IndexReader to filter the documents before scoring, but that's not yet implemented (see https://issues.apache.org/jira/browse/LUCENE-3212) - I am working on it. We may also make Filters random access (it's easy as they are bitsets), which could improve also the after-query filtering. But I would then also make Queries partially random access, if they could support it (like queries that are only based on FieldCache).
Uwe
In contrast to Dennis' answer: no, you probably don't want to use a filter unless you're going to reuse the same query multiple times.
A NumericRangeFilter
is just a subclass of MultiTermQueryWrapperFilter
, which means that essentially it does something like this:
for each document in index:
if document matches query:
match[i] = 1
else
match[i] = 0
So it will run in linear time over your index instead of logarithmic time like a normal query.
Additionally, the filter will take up more memory (one bit for every doc in your index).
If you're going to be using the same query over and over again, then it's probably worth it to you to pay the performance/memory hit once and have later usages be faster. But if it's a one-off query, it's almost certainly not worth it.
(Also, if you're going to reuse it, use a CachingWrapperFilter
so that the filter is cached.)
If the filter will be reused it is wise to use this instead of queries because of caching purposes. If you are not going to be using the scoring or field values it also makes sense to use filter over query.
Hope this helps.
I found this in http://wiki.apache.org/lucene-java/ImproveSearchingSpeed which seems to suggest to use filters rather than queries. Intuitively it makes more sense to me as they pretty much should do the same thing, the only difference being that filters are not used in the score.
Consider using filters. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially true for restrictions that match a great number of documents of a large index. Filters are typically used to restrict the results to a category but could in many cases be used to replace any query clause. One difference between using a Query and a Filter is that the Query has an impact on the score while a Filter does not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With