I use following filter in the schema.xml:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15" side="front"/>
How can I boost the longer ngrams? For example, when I search for "bookpage", a document which contains "bookpage" should be rated a lot higher than a document with only "book".
I don't know of a way to dynamically boost based on term length (i.e., with a Function Query operator). I suspect there isn't one.
That said, I often want to approximate the logic you're looking for: longer term matches deserve a higher semantic weight.
Most commonly, I will index the text value into two different fields. One is a minimally-processed text field without ngrams. The other is similar, but also processed with ngrams.
Here are some sample excerpts of a schema that I have used in this fashion. For searches against this schema, I would boost the text
field heavily over the text_ngram
. Thus any matches against the text
field would greatly influence the relevancy, while matches against text_ngram
can still pick up perhaps-relevant results as well.
<?xml version="1.0" encoding="UTF-8"?>
<schema name="Sunspot Customized NZ" version="1.0">
<types>
<!--
A text type with minimal text processing, for the greatest semantic
value in a term match. Boost this field heavily.
-->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StandardFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<!--
Looser matches with NGram processing for substrings of terms and synonyms
-->
<fieldType name="text_ngram" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StandardFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="6" side="front" />
</analyzer>
</fieldType>
<!-- other stuff -->
</types>
<fields>
<!-- id, other scalar values -->
<!-- catch-all for the text and text_ngram types -->
<field name="text" stored="false" type="text" multiValued="true" indexed="true" />
<field name="text_ngram" stored="false" type="text_ngram" multiValued="true" indexed="true" />
<!-- various dynamicField definitions -->
<!-- sample dynamicField definitions for text and text_ngram -->
<dynamicField name="*_text" type="text" indexed="true" stored="false" multiValued="false" />
<dynamicField name="*_text_ngram" type="text_ngram" indexed="true" stored="false" multiValued="false" />
</fields>
<!-- copy text fields into my text and text_ngram catch-all fields -->
<copyField source="*_text" dest="text" />
<copyField source="*_text" dest="text_ngram" />
</schema>
This isn't exactly what you're looking for, but you could use a similar approach.
For example, create a small collection of intermediate NGram-processed field types -- say, of length 1-3, 4-6, 7-9 -- and give them increased boosts accordingly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With