Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene: Filtering for documents NOT containing a Term

I have an index whose documents have two fields (actually more like 800 fields but the other fields won't concern us here):

  • The contents field contains the analyzed/tokenized text of the document. The query string is searched for in this field.
  • The category field contains the single category identifier of the document. There are about 2500 different categories, and a document may occur in several of them (i.e. a document may have multiple category entries. The results are filtered by this field.

The index contains about 20 mio. documents and is 5 GB in size.

The index is queried with a user-provided query string, plus an optional set of a few categories the user is not interested in. The question is: how can I remove those documents matching not only the query string but also the unwanted categories.

I could use a BooleanQuery with a MUST_NOT clause, i.e. something like this:

BooleanQuery q = new BooleanQuery();
q.add(contentQuery, BooleanClause.MUST);
for (String unwanted: unwantedCategories) {
    q.add(new TermsQuery(new Term("category", unwanted), BooleanClause.MUST_NOT);
}

Is there a way to do this with Lucene filters? Performance is an issue here, and there will only be a few, recurring, variants of unwantedCategories, so a CachingWrapperFilter would probably help a lot. Also, due to the way the Lucene queries are generated in the existing code base, it is difficult to fit this in, whereas an extra Filter could be introduced easily.

In other words, How do I create a Filter based on what terms must _not_ occur in a document?

like image 584
digitalarbeiter Avatar asked Dec 20 '10 11:12

digitalarbeiter


People also ask

How do you use not in Lucene query?

The short answer is that this is not possible using the standard Lucene. Lucene does not allow NOT queries as a single term for the same reason it does not allow prefix queries - to perform either, the engine would have to look through each document to ascertain whether the document is/is not a hit.

How do you search in Lucene?

Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

What is Lucene filter?

Lucene is a query language that can be used to filter messages in your PhishER inbox. A query written in Lucene can be broken down into three parts: Field The ID or name of a specific container of information in a database. If a field is referenced in a query string, a colon ( : ) must follow the field name.

How does Lucene index work?

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.


1 Answers

One word answer: BooleanFilter, found it minutes after formulating the question:

BooleanFilter f = new BooleanFilter();
for (String unwanted: unwantedCategories) {
    TermsFilter tf = new TermsFilter(new Term("category", unwanted));
    f.add(new FilterClause(tf, BooleanClause.MUST_NOT));
}
like image 151
digitalarbeiter Avatar answered Sep 22 '22 00:09

digitalarbeiter