Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build Solr suggestions based on sentences instead of the entire field value

I have a Solr instance with a suggester component. It works fine, using the AnalyzingInfixLookupFactory implementation.

However, I want to expand the suggestions to a content field, which can contain a lot of text. The suggester finds suggestions all right, but it returns the entire field value, instead of just a sentence, or part of a sentence.

So, if I want a suggestion for "foo", and the content field contains a text like:

"I really like pizza. And donuts. Let's get some from that other place. The foo bar place."

The suggestion will be that entire text, instead of just "The foo bar place". And, obviously, when content is hundreds of words long, this is just not usabe.

Is there a way to limit the number of returned words for a suggestion?

Here's my search component:

<searchComponent name="suggest" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">autocomplete</str>
    <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
    <str name="indexPath">suggestions</str>
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">suggest</str>
    <str name="suggestAnalyzerFieldType">text_suggest</str>
    <str name="buildOnStartup">false</str>
    <bool name="highlight">false</bool>
    <str name="payloadField">label</str>
  </lst>
</searchComponent>

And here's the request handler:

<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="suggest">true</str>
    <str name="suggest.dictionary">autocomplete</str>
    <str name="suggest.count">10</str>
  </lst>
  <arr name="components">
    <str>suggest</str>
  </arr>
</requestHandler>

Finally, here is the field from which the suggestions are derived:

<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<field name="suggest" type="text_suggest" indexed="true" multiValued="true" stored="true"/>

I then use a bunch of <copyField>s to copy the content over.

EDIT 2015-08-28

The content field definition is as follows:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="txt/mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="txt/stopwords.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="25"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.MappingCharFilterFactory" mapping="txt/mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<field name="content" type="text" indexed="true" stored="true" termVectors="true"/>

EDIT 2016-09-28

This issue is probably related: Is Solr SuggestComponent able to return shingles instead of whole field values?

like image 924
wadmiraal Avatar asked Aug 14 '15 15:08

wadmiraal


1 Answers

I think what you might be looking for is solr.ShingleFilterFactory, which simply allows to limit the token size basing on the words count, rather than text lenght as in solr.NGramFilterFactory you've been trying to use.
Please see SOLR wiki page for more details:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory

like image 107
llesiuk Avatar answered Oct 18 '22 15:10

llesiuk