Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make Solr spellchecker to correct both Latin and Cyrillic words?

Tags:

solr

I allow users to type Russian words in Latin letters. If user misspells Russian word in Latin letters, I want Solr spellchecker to suggest correct word in Cyrillic (Russian words in the index is in Cyrillic). However, if user misspells not a Russian word (for example a brand name), it should be corrected in Latin letters (not russian words in the index is in Latin).

For example, tilevizor smasung should be fixed to телевизор samsung

Now I'm using the following configuration:

<fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
    </analyzer>
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    </analyzer>
</fieldType>

It converts query to Cyrillic letters, so Russian words correction works. But Latin doesn't. (tilevizor to телевизор works, but smasung to samsung doesn't).

Any ideas, how can I make spellchecker to correct both Cyrillic and Latin words?

like image 719
Rinas Avatar asked Dec 03 '13 12:12

Rinas


1 Answers

I think, that solution, that could help here is Beider-Morse Phonetic Matching (BMPM)

Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic matching system.

So, for example words 'tilevizor' and 'телевизор' will sound a like and we will get a match. Something that could be tuned is the algorithm for phonetic matching. Solr is supporting a lot of them and I'm not sure which one will perform better : DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone (v2.0), ColognePhonetic, or Nysiis.

Also, I would like to update solr.ICUTransformFilterFactory with id="Russian-Latin/BGN", which do a much better job converting Russian symbols to Latin ones.

    <fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/>
            <filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/>
        </analyzer>
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/>
            <filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/>
        </analyzer>
    </fieldType>

The fieldType above do the trick in a lot of cases, e.g

q=title:tilevizor
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}

q=title:тилевизор
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}

q=title:smasung
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}
SolrDocument{title=гэлакси samsung, _version_=1583123812684136448}
SolrDocument{title=galaxy самсунг, _version_=1583123812684136449}

I've created the following test class here, feel free to play with this one.

like image 198
Mysterion Avatar answered Oct 13 '22 05:10

Mysterion