Ignoring specific characters with Elasticsearch asciifolding

Question

In my analyzer, I have added the asciifolding filter. In most cases this works very well, but when working with the danish language, I would like to not normalize the øæå characters, since "rød" and "rod" are very different words.

We are using the hosted elastic cloud cluster, so if possible a solution that does not require any non-standard deployments through the cloud platform.

Is there any way to do asciifolding, but whitelist certain characters?

Currently running on ES version 6.8

Silas Hansen · Accepted Answer

You should probably be using the ICU Folding Token Filter.

From the documentation:

Case folding of Unicode characters based on UTR#30, like the ASCII-folding token filter on steroids.

It let's you do everything that the AsciiFolding filter does, but in addition to this, it also allows you to ignore a range of characters through the unicodeSetFilter property.

In this case, you want to ignore æ,ø,å,Æ,Ø,Å:

"unicodeSetFilter": "[^æøåÆØÅ]"

Complete example:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "danish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "danish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "danish_folding": {
            "type": "icu_folding",
            "unicodeSetFilter": "[^æøåÆØÅ]"
          }
        }
      }
    }
  }
}

Ignoring specific characters with Elasticsearch asciifolding

Tags:

elasticsearch

elastic-cloud

mortenbock

1 Answers

Silas Hansen

Recent Activity

Donate For Us

Ignoring specific characters with Elasticsearch asciifolding

Tags:

elasticsearch

elastic-cloud

mortenbock

1 Answers

Silas Hansen

Related questions

Recent Activity

Donate For Us