Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Howto perform a 'contains' search rather than 'starts with' using Lucene.Net

We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.

Currently all terms get appended a * which leads Lucene to perform what I would classify as a StartsWith search.

In the future we would like to have a search that performs something like a Contains rather than a StartsWith.

We use

  • Lucene.Net 2.9.2.2
  • StandardAnalyzer
  • default QueryParser

Samples:

(Title:Orch*) matches: Orchestra

but:

(Title:rch*) does not match: Orchestra

We want the first and the second one to both match Orchestra.

Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains and rather than a StartsWith by default:
Why is this Lucene query a "contains" instead of a "startsWith"?

How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.

like image 551
ntziolis Avatar asked Mar 30 '11 10:03

ntziolis


2 Answers

First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a* will match "Fleet Africa" because it's tokenized into "fleet" and "africa".

You need to call QueryParser.SetAllowLeadingWildcard(true) to be able to write queries like field:*value*. Are you actually changing the string that's passed to QueryParser?

You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery into WildcardQuery. That way you still support phrase searches.

I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.

like image 156
sisve Avatar answered Nov 02 '22 02:11

sisve


@Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.

Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.

like image 42
Xodarap Avatar answered Nov 02 '22 02:11

Xodarap