I'm building a search engine for a website where users can be of many different countries and post text content.
I'll consider that: - A french generates content in french and english - A german generates content in german and english etc...
What i'd like to know if it is possible to make a search using different snowball stemmer langages in the same time, so that we have appropriate results in the same time.
Do we have to create one index per snowball stemmer langage?
Is there a known pattern for such a case?
Thanks
We'll support only a finite set of languages (German, English, Korean, Japanese and Chinese) since we need to set up a specific analyzer for each language. Any documents that aren't in one of our supported languages will get indexed in a default field with the standard analyzer.
In Elasticsearch, stemming is handled by stemmer token filters. These token filters can be categorized based on how they stem words: Algorithmic stemmers, which stem words based on a set of rules. Dictionary stemmers, which stem words by looking them up in a dictionary.
The following types are supported: arabic , armenian , basque , bengali , brazilian , bulgarian , catalan , cjk , czech , danish , dutch , english , estonian , finnish , french , galician , german , greek , hindi , hungarian , indonesian , irish , italian , latvian , lithuanian , norwegian , persian , portuguese , ...
So quick disclaimer, I'm not an expert in stemming/language morphology but since noone else is responding, here's my understanding. Also, most of my experience is along the lines of solr.
In order to be able to query with stemming against multiple languages with a single, mixed result set, you need to use a multilingual stemmer. I'm not sure what is available for elastisearch.
Trying to apply multiple stemmers designed for single languages to a single index will step on each other's toes and likely not produce expected results (stemming rules vary significantly depending on the language).
Having an index per language with respective stemmers works for queries with single language results. Trying to combine results from multiple queries against multiple indices is usually fairly problematic (you have to attempt to normalize relevancy and deal with paging).
You can create 2 separate indices and search on both ( or all ) at the same time. As long as fields of indices are the same you will get valid results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With