Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google like autosuggest / typeahead (suggesting keywords / phrases) with Solr

Requirements

I need a google like suggestions in a search box. Solr is already a given. The results should look like this:

searchterm Alex
results Alexander Behling, Alexander Someone ...

searchterm cab
results cable, high voltage cable, cable cutter enter image description here The aim is to have phrases as suggestion and not entire fields or excerpts. The query should be caseinsensitive, Alex should have the same results as alex, but the searchresult (suggestions) must have the original case.
The suggestions must be filterable by category, we have the results of several domains in one index and the result should be filtered by a specific field containing the domain. contextField only works with "AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory."

I tried three approaches

1. Approach : FreeTextLookupFactory

config (no special schema changes): 
     <searchComponent name="suggest" class="solr.SuggestComponent">
        <lst name="suggester">
          <str name="name">default</str>
          <str name="lookupImpl">FreeTextLookupFactory</str> 
          <str name="dictionaryImpl">DocumentDictionaryFactory</str>
          <str name="field">content</str>
          <str name="ngrams">3</str>
          <str name="separator"> </str>
          <str name="suggestFreeTextAnalyzerFieldType">text_general</str>
        </lst>
    </searchComponent>

    <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
      <lst name="defaults">
        <str name="suggest">true</str>
        <str name="suggest.count">10</str>
        <str name="suggest.dictionary">default</str>        
        <str name="echoParams">explicit</str>
      </lst>
      <arr name="components">
         <str>suggest</str>
      </arr>
    </requestHandler>

This works reasonable well, but delivers only single words.
searchterm Alex
results Alexander, Alexandra ...
Advantage is a very high indexing speed. I tried to combine this with a ShingleFilter, but this didn't work, probably because the ShingleFilter is already part of the FreeTextLookupFactory. Because of the FreeTextLookupFactory categories are not supported.

2. Approach : BlendedInfixLookupFactory with custom field

schema:
<field name="suggest_field" type="text_suggest" indexed="true" stored="true" multiValued="true"/>
<field name="site" type="string" stored="true" indexed="true"/>
<copyField source="content" dest="suggest_field"/>

    <fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <!--filter class="solr.LowerCaseFilterFactory"/-->
                <filter class="solr.TrimFilterFactory"/>
                <filter class="solr.ShingleFilterFactory" 
                    minShingleSize="2"
                    maxShingleSize="4"
                    outputUnigrams="true"
                    outputUnigramsIfNoShingles="true"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
           </analyzer>
    </fieldType>

config:
<searchComponent name="suggest" class="solr.SuggestComponent">
   <lst name="suggester">
      <str name="name">default</str>
      <str name="lookupImpl">BlendedInfixLookupFactory</str>
      <str name="blenderType">position_linear</str>
      <str name="dictionaryimpl">DocumentDictionaryFactory</str>
      <str name="field">suggest_field</str>
      <str name="weightField">weight</str>
      <str name="suggestAnalyzerFieldType">text_suggest</str>
      <str name="queryAnalyzerFieldType">phrase_suggest</str>
      <str name="indexPath">suggest</str>
      <str name="buildOnStartup">false</str>
      <str name="buildOnCommit">false</str>
      <bool name="exactMatchFirst">true</bool>
      <str name="contextField">site</str>
   </lst> 
</searchComponent>

    <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
      <lst name="defaults">
        <str name="suggest">true</str>
        <str name="suggest.count">10</str>
        <str name="suggest.dictionary">default</str>        
        <str name="echoParams">explicit</str>
      </lst>
      <arr name="components">
         <str>suggest</str>
      </arr>
    </requestHandler>FreeTextLookupFactory

The second approach leads to a for me strange behaviour:

searchterm Alex or alex
results nothing ...
searchterm cab
results cable, cables, voltage cables, cable accessories, power cables ...

Using the same fields, there are no search results for certain queries. The indexing speed is already > 12h for <10000 entries. Due to the BlendedInfixLookupFactory and DocumentDictionaryFactory categories should be supported. But when using a category in the query. http://localhost:8983/solr/magnolia/suggest?wt=json&suggest=true&suggest.q=nym&suggest.cfq=com the results are empty. The field "site" does contain the value "com" multiple times in the index.

3. Approach BlendedInfixLookupFactory with HighFrequencyDictionaryFactory and custom field

schema:

 <field name="suggest_field" type="text_shingle" indexed="true" stored="true" multiValued="true"/>
...
<copyField source="_text_" dest="suggest_field"/>
...
    <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
           <charFilter class="solr.HTMLStripCharFilterFactory"/>
           <filter class="solr.TrimFilterFactory"/>
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_suggestions.txt" format="snowball" />
           <!--filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15"/-->
           <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="4" outputUnigrams="false" outputUnigramsIfNoShingles="true" fillerToken=""/>
        </analyzer>
    </fieldType>
    <!-- marc johnen : used for autocomplete-->
    <fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
             <tokenizer class="solr.StandardTokenizerFactory"/>
             <filter class="solr.LowerCaseFilterFactory"/>
             <filter class="solr.TrimFilterFactory"/>
          </analyzer>
    </fieldType>

config:
    <searchComponent name="suggest" class="solr.SuggestComponent">
      <lst name="suggester">
        <str name="name">default</str>
        <str name="lookupImpl">BlendedInfixLookupFactory</str>
        <str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
        <str name="field">suggest_field</str>
        <str name="suggestAnalyzerFieldType">text_suggest</str>
        <str name="minPrefixChars">2</str>
        <str name="exactMatchFirst">true</str>
        <str name="buildOnStartup">false</str> 
        <str name="buildOnCommit">true</str>
        <str name="highlight">false</str>
      </lst>
    </searchComponent>

    <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
      <lst name="defaults">
        <str name="suggest">true</str>
        <str name="suggest.count">10</str>
        <str name="suggest.dictionary">default</str>        
        <str name="echoParams">explicit</str>
      </lst>
      <arr name="components">
         <str>suggest</str>
      </arr>
    </requestHandler>

The results of this approach are quite good, basically as specified except for some duplicate phrases because some keywords are duplicated because they have whitespaces at the beginning or end like "power cable" and "power cable ". Other than that quite good.

searchterm Alex
results Alexander Behling, Alexander Someone ...

searchterm cab
results cable, high voltage cable, cable cutter

Indexing easily takes a day for <10000 documents. The main problem though is that because of the HighFrequencyDictionaryFactory categories are not supported.

Query

The query I use looks like this:

http://localhost:8983/solr/magnolia/suggest?wt=json&suggest=true&suggest.q=cab

Adding a <str name="contextField">site</str> in the config for categories and &suggest.cfq=com to the query when applicable.

like image 348
Marc Johnen Avatar asked Jun 02 '21 20:06

Marc Johnen


People also ask

What is suggester in Solr?

Suggester is a search component, which is a building block of Solr’s search pipeline. To make this component work, two things need to be configured in the search engine’s config: the data source for suggestions (dictionaryImpl parameter), and how these suggestions are stored and searched in query-time (lookupImpl parameter).

Why doesn’t Google auto-suggest your name when you type it?

But today, Google won’t auto-suggest their names as you begin to type, deeming them too piracy related. Aside from taking out some potentially innocent parties, the whole thing feels kind of hypocritical.

Is it safe to use SOLR for Suggestions search?

And, if you can reduce the suggestions search to a single term search, which will result in a corresponding increase in the suggestion index, it would be the simplest and safest option to use. Maintainability — monitoring using Solr's index is much more reliable than using an internal in-memory data structure or internal index.

Why do Google’s search suggestions appear in different languages?

Language also has an impact. Different suggestions will appear if you’ve told Google that you prefer to search in a particular language, or based on the language Google assumes you use, as determined by your browser’s settings. Google’s suggestions may also contain things you’ve searched for before, if you make use of Google’s web history feature.


1 Answers

I ended up using the FreeTextLookupFactory and created a separate field and suggester for each language.

like image 120
Marc Johnen Avatar answered Oct 19 '22 20:10

Marc Johnen