I have an index whose documents have two fields (actually more like 800 fields but the other fields won't concern us here):
contents
field contains the analyzed/tokenized text of the document. The query string is searched for in this field.category
field contains the single category identifier of the document. There are about 2500 different categories, and a document may occur in several of them (i.e. a document may have multiple category
entries. The results are filtered by this field.The index contains about 20 mio. documents and is 5 GB in size.
The index is queried with a user-provided query string, plus an optional set of a few categories the user is not interested in. The question is: how can I remove those documents matching not only the query string but also the unwanted categories.
I could use a BooleanQuery
with a MUST_NOT
clause, i.e. something like this:
BooleanQuery q = new BooleanQuery();
q.add(contentQuery, BooleanClause.MUST);
for (String unwanted: unwantedCategories) {
q.add(new TermsQuery(new Term("category", unwanted), BooleanClause.MUST_NOT);
}
Is there a way to do this with Lucene filters? Performance is an issue here, and there will only be a few, recurring, variants of unwantedCategories
, so a CachingWrapperFilter
would probably help a lot. Also, due to the way the Lucene queries are generated in the existing code base, it is difficult to fit this in, whereas an extra Filter
could be introduced easily.
In other words, How do I create a Filter
based on what terms must _not_ occur in a document?
The short answer is that this is not possible using the standard Lucene. Lucene does not allow NOT queries as a single term for the same reason it does not allow prefix queries - to perform either, the engine would have to look through each document to ascertain whether the document is/is not a hit.
Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.
Lucene is a query language that can be used to filter messages in your PhishER inbox. A query written in Lucene can be broken down into three parts: Field The ID or name of a specific container of information in a database. If a field is referenced in a query string, a colon ( : ) must follow the field name.
In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.
One word answer: BooleanFilter
, found it minutes after formulating the question:
BooleanFilter f = new BooleanFilter();
for (String unwanted: unwantedCategories) {
TermsFilter tf = new TermsFilter(new Term("category", unwanted));
f.add(new FilterClause(tf, BooleanClause.MUST_NOT));
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With