We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.
Currently all terms get appended a *
which leads Lucene to perform what I would classify as a StartsWith
search.
In the future we would like to have a search that performs something like a Contains
rather than a StartsWith
.
We use
Samples:
(Title:Orch*)
matches: Orchestra
but:
(Title:rch*)
does not match: Orchestra
We want the first and the second one to both match Orchestra
.
Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains
and rather than a StartsWith
by default:
Why is this Lucene query a "contains" instead of a "startsWith"?
How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.
First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a*
will match "Fleet Africa" because it's tokenized into "fleet" and "africa".
You need to call QueryParser.SetAllowLeadingWildcard(true)
to be able to write queries like field:*value*
. Are you actually changing the string that's passed to QueryParser?
You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery
into WildcardQuery
. That way you still support phrase searches.
I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.
@Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.
Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With