I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.
For example you might search for "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".
I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.
For example if I have a query:
and the documents
I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).
Is this sort of query possible with SOLR/lucene?
It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.
At index time your documents are then indexed as such:
At query time your query becomes:
This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):
BooleanQuery:
this should be the most performant way, as then its really just a booleanquery of termqueries.
Sounds like you want the DisMax "minimum match" parameter. I wrote a blog article on the concept here a little while: http://blog.websolr.com/post/1299174416. There's also the Solr wiki on minimum match.
The "minimum match" concept is applied against all the "optional" terms in your query -- terms that aren't explicitly specified, using +/-, whether they are "+mandatory" or "-prohibited". By default, the minimum match is 100%, meaning that 100% of the optional terms must be present. In other words, all of your terms are considered mandatory.
This is why your longer query isn't currently matching documents containing shorter fragments of that phrase. The other keywords in the longer search phrase are treated as mandatory.
If you drop the minimum match down to 1
, then only one of your optional terms will be considered mandatory. In some ways this is the opposite of the default of 100%. It's like your query of quick brown fox…
is turned into quick OR brown OR fox OR …
and so on.
If you set your minimum match to 2
, then your search phrase will get broken up into groups of two terms. A search for quick brown fox
turns into (quick brown) OR (brown fox) OR (quick fox) …
and so on. (Excuse my psuedo-query there, I trust you see the point.)
The minimum match parameter also supports percentages -- say, 20%
-- and some even more complex expressions. So there's a fair amount of tweakability.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With