I allow users to type Russian words in Latin letters. If user misspells Russian word in Latin letters, I want Solr spellchecker to suggest correct word in Cyrillic (Russian words in the index is in Cyrillic). However, if user misspells not a Russian word (for example a brand name), it should be corrected in Latin letters (not russian words in the index is in Latin).
For example, tilevizor smasung
should be fixed to телевизор samsung
Now I'm using the following configuration:
<fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="256" />
</analyzer>
</fieldType>
It converts query to Cyrillic letters, so Russian words correction works. But Latin doesn't. (tilevizor
to телевизор
works, but smasung
to samsung
doesn't).
Any ideas, how can I make spellchecker to correct both Cyrillic and Latin words?
I think, that solution, that could help here is Beider-Morse Phonetic Matching (BMPM)
Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic matching system.
So, for example words 'tilevizor' and 'телевизор' will sound a like and we will get a match. Something that could be tuned is the algorithm for phonetic matching. Solr is supporting a lot of them and I'm not sure which one will perform better : DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone (v2.0), ColognePhonetic, or Nysiis.
Also, I would like to update solr.ICUTransformFilterFactory
with id="Russian-Latin/BGN"
, which do a much better job converting Russian symbols to Latin ones.
<fieldType name="spell_ru" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/>
<filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="Russian-Latin/BGN"/>
<filter class="solr.PhoneticFilterFactory" encoder="Caverphone"/>
</analyzer>
</fieldType>
The fieldType above do the trick in a lot of cases, e.g
q=title:tilevizor
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}
q=title:тилевизор
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}
q=title:smasung
SolrDocument{title=телевизор samsung, _version_=1583123812650582016}
SolrDocument{title=televizor самсунг, _version_=1583123812667359232}
SolrDocument{title=гэлакси samsung, _version_=1583123812684136448}
SolrDocument{title=galaxy самсунг, _version_=1583123812684136449}
I've created the following test class here, feel free to play with this one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With