Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch Stemming

I am using ElasticSerach and I want to setup basic stemming for English. So basically, fighter returns fight or any word that contains the fight root.

I am a little confused how to implement this. I was reading through the analyzers, tokenizers and filters and there are multiple stemming algorithms that can be used in ElasticSearch. I am just not sure which combination to use - snowball, stemmer, porter stem or synonym filters.

Also, an example of the mapping would be really helpful.

like image 261
Gabbar Avatar asked Jul 11 '12 14:07

Gabbar


Video Answer


1 Answers

Please mind the difference between stemming and lemmatisation. Stemming algorithm applies a series of rules (and/or dictionary lookups, as is the case e.g. for KStem) and doesn't guarantee that the result will be a proper lingustic 'root' (i.e. lemma).

So for instance both words 'marinate' and 'marines' will be converted to 'marin' by Porter stemmer, which is being considered quite 'aggresive' one -- it tends to produce the same stem for big number of words. There are more conservative ones, as for example the S-Stemmer, which only converts plural to singular forms (org.apache.lucene.analysis.en.EnglishMinimalStemFilter).

Comparisons of stemming methods found in research papers seem to favor KStem as being most effective for English texts, but the choice of stemmer highly depends on the vocabulary of your documents. You don't aim to optimize stemmer performance, but rather the performance of the search engine, so measuring it in separation from other elements of your system (especially query expansion) is not a good idea in practice.

The best solution is to try a number of different stemmers that are available in elasticsearch (an example mapping can be seen here) and observe the precision and recall of the results. If you don't have a test suite of queries, then your best bet is to perform 'typical' queries and watch out for 'strange' results (effects of the stemmer being too aggresive) or 'good' results being ommitted (too conservative stemmer).

like image 83
Artur Nowak Avatar answered Sep 20 '22 01:09

Artur Nowak