Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use multiple stemming languages with ElasticSearch

I'm building a search engine for a website where users can be of many different countries and post text content.

I'll consider that: - A french generates content in french and english - A german generates content in german and english etc...

What i'd like to know if it is possible to make a search using different snowball stemmer langages in the same time, so that we have appropriate results in the same time.

Do we have to create one index per snowball stemmer langage?

Is there a known pattern for such a case?

Thanks

like image 529
Sebastien Lorber Avatar asked Jun 14 '12 22:06

Sebastien Lorber


People also ask

Does Elasticsearch support multi language?

We'll support only a finite set of languages (German, English, Korean, Japanese and Chinese) since we need to set up a specific analyzer for each language. Any documents that aren't in one of our supported languages will get indexed in a default field with the standard analyzer.

Does Elasticsearch do Stemming?

In Elasticsearch, stemming is handled by stemmer token filters. These token filters can be categorized based on how they stem words: Algorithmic stemmers, which stem words based on a set of rules. Dictionary stemmers, which stem words by looking them up in a dictionary.

What text languages does Elasticsearch support?

The following types are supported: arabic , armenian , basque , bengali , brazilian , bulgarian , catalan , cjk , czech , danish , dutch , english , estonian , finnish , french , galician , german , greek , hindi , hungarian , indonesian , irish , italian , latvian , lithuanian , norwegian , persian , portuguese , ...


2 Answers

So quick disclaimer, I'm not an expert in stemming/language morphology but since noone else is responding, here's my understanding. Also, most of my experience is along the lines of solr.

In order to be able to query with stemming against multiple languages with a single, mixed result set, you need to use a multilingual stemmer. I'm not sure what is available for elastisearch.

Trying to apply multiple stemmers designed for single languages to a single index will step on each other's toes and likely not produce expected results (stemming rules vary significantly depending on the language).

Having an index per language with respective stemmers works for queries with single language results. Trying to combine results from multiple queries against multiple indices is usually fairly problematic (you have to attempt to normalize relevancy and deal with paging).

like image 160
Kenneth Ito Avatar answered Oct 04 '22 19:10

Kenneth Ito


You can create 2 separate indices and search on both ( or all ) at the same time. As long as fields of indices are the same you will get valid results.

like image 31
farid Avatar answered Oct 04 '22 19:10

farid