Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignoring specific characters with Elasticsearch asciifolding

In my analyzer, I have added the asciifolding filter. In most cases this works very well, but when working with the danish language, I would like to not normalize the øæå characters, since "rød" and "rod" are very different words.

We are using the hosted elastic cloud cluster, so if possible a solution that does not require any non-standard deployments through the cloud platform.

Is there any way to do asciifolding, but whitelist certain characters?

Currently running on ES version 6.8

like image 536
mortenbock Avatar asked Oct 16 '25 20:10

mortenbock


1 Answers

You should probably be using the ICU Folding Token Filter.

From the documentation:

Case folding of Unicode characters based on UTR#30, like the ASCII-folding token filter on steroids.

It let's you do everything that the AsciiFolding filter does, but in addition to this, it also allows you to ignore a range of characters through the unicodeSetFilter property.

In this case, you want to ignore æ,ø,å,Æ,Ø,Å:

"unicodeSetFilter": "[^æøåÆØÅ]"

Complete example:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "danish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "danish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "danish_folding": {
            "type": "icu_folding",
            "unicodeSetFilter": "[^æøåÆØÅ]"
          }
        }
      }
    }
  }
}
like image 156
Silas Hansen Avatar answered Oct 18 '25 22:10

Silas Hansen