Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr russian spellcheck

I am using solr spellcheck for russian language. When you are typing with Cyrillic chars, everything it's ok, but it doesn't work when you are typing with Latin chars.

I want that spellcheck correct and when you are typing with Cyrillic chars and when are you typing with Latin chars. And corret to text with Cyrillic chars.

For example, when you type:

телевидениеее or televidenieee

It should correct to:

телевидение

schema.xml:

<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    </analyzer>
</fieldType>

solrconfig.xml

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
        <str name="name">default</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.IndexBasedSpellChecker</str>
        <str name="buildOnCommit">true</str>
        <str name="buildOnOptimize">true</str>
        <str name="spellcheckIndexDir">./spellchecker</str>
        <str name="accuracy">0.75</str>
    </lst>
    <lst name="spellchecker">
        <str name="name">wordbreak</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.WordBreakSolrSpellChecker</str>
        <str name="combineWords">false</str>
        <str name="breakWords">true</str>
        <int name="maxChanges">1</int>
    </lst>
</searchComponent>

Thanks for help

like image 623
KiraLT Avatar asked Oct 31 '13 19:10

KiraLT


1 Answers

It can be achived with ICUTransformFilterFactory, which will (un)transliterate the input query each time.

Here is an example, of how one can enable this functionality:

  1. Enable icu4j amalyzers (lucene-analyzers-icu-*.jar, icu4j-*.jar):

    Those libraries can be found in contrib/analysis-extras folder of solr distribution from official site (they also available via maven).

    In solrconfig.xml add something like these to enable them (there can be a single lib dir with all the jars that you need, in this example it just uses default location relative to example/solr/collection1/conf folder from official distribution):

    <lib dir="../../../contrib/analysis-extras/lib" regex=".*\.jar" />
    <lib dir="../../../contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
    
  2. Split spell_text field analyzers into two separate list for index and query.

  3. Add solr.ICUTransformFilterFactory as query analyzer with the following id Any-Cyrillic; NFD; [^\p{Alnum}] Remove:

    <fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    
        <filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
      </analyzer>
    </fieldType>
    

Regarding the ICUTransformFilterFactory id - Any-Cyrillic; NFD; [^\p{Alnum}] Remove:

  • Related stackoverflow question
  • Official guide

The configuration described above is working on my local machine the same way for russian transliterations and russian words

like image 68
rchukh Avatar answered Sep 28 '22 05:09

rchukh