Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to boost longer ngrams in solr?

Tags:

search

solr

I use following filter in the schema.xml:

<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15" side="front"/>

How can I boost the longer ngrams? For example, when I search for "bookpage", a document which contains "bookpage" should be rated a lot higher than a document with only "book".

like image 570
ndee Avatar asked Nov 15 '11 13:11

ndee


1 Answers

I don't know of a way to dynamically boost based on term length (i.e., with a Function Query operator). I suspect there isn't one.

That said, I often want to approximate the logic you're looking for: longer term matches deserve a higher semantic weight.

Most commonly, I will index the text value into two different fields. One is a minimally-processed text field without ngrams. The other is similar, but also processed with ngrams.

Here are some sample excerpts of a schema that I have used in this fashion. For searches against this schema, I would boost the text field heavily over the text_ngram. Thus any matches against the text field would greatly influence the relevancy, while matches against text_ngram can still pick up perhaps-relevant results as well.

<?xml version="1.0" encoding="UTF-8"?>
<schema name="Sunspot Customized NZ" version="1.0">
  <types>

    <!--
      A text type with minimal text processing, for the greatest semantic
      value in a term match. Boost this field heavily.
    -->
    <fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StandardFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>

    <!--
      Looser matches with NGram processing for substrings of terms and synonyms
    -->
    <fieldType name="text_ngram" class="solr.TextField" omitNorms="false">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.StandardFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="6" side="front" />
      </analyzer>
    </fieldType>

    <!-- other stuff -->

  </types>
  <fields>

    <!-- id, other scalar values -->

    <!-- catch-all for the text and text_ngram types -->
    <field name="text"       stored="false" type="text"        multiValued="true"  indexed="true" />
    <field name="text_ngram" stored="false" type="text_ngram"  multiValued="true"  indexed="true" />

    <!-- various dynamicField definitions -->

    <!-- sample dynamicField definitions for text and text_ngram -->
    <dynamicField name="*_text"   type="text" indexed="true" stored="false" multiValued="false" />
    <dynamicField name="*_text_ngram"   type="text_ngram" indexed="true" stored="false" multiValued="false" />

  </fields>

  <!-- copy text fields into my text and text_ngram catch-all fields -->
  <copyField source="*_text"  dest="text" />
  <copyField source="*_text"  dest="text_ngram" />

</schema>

This isn't exactly what you're looking for, but you could use a similar approach.

For example, create a small collection of intermediate NGram-processed field types -- say, of length 1-3, 4-6, 7-9 -- and give them increased boosts accordingly.

like image 52
Nick Zadrozny Avatar answered Sep 28 '22 13:09

Nick Zadrozny