Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch search for Turkish characters

I have some documents that i am indexing with elasticsearch. But some of the documents are written with upper case and Tukish characters are changed. For example "kürşat" is written as "KURSAT".

I want to find this document by searching "kürşat". How can i do that?

Thanks

like image 295
Kursat Serolar Avatar asked Jan 05 '23 11:01

Kursat Serolar


1 Answers

Take a look at the asciifolding token filter.

Here is a small example for you to try out in Sense:

Index:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        }
      },
      "analyzer": {
        "turkish_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_ascii_folding"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "turkish_analyzer"
        }
      }
    }
  }
}

POST test/test/1
{
  "name": "kürşat"
}

POST test/test/2
{
  "name": "KURSAT"
}

Query:

GET test/_search
{
  "query": {
    "match": {
      "name": "kursat"
    }
  }
}

Response:

 "hits": {
    "total": 2,
    "max_score": 0.30685282,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "2",
        "_score": 0.30685282,
        "_source": {
          "name": "KURSAT"
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.30685282,
        "_source": {
          "name": "kürşat"
        }
      }
    ]
  }

Query:

GET test/_search
{
  "query": {
    "match": {
      "name": "kürşat"
    }
  }
}

Response:

 "hits": {
    "total": 2,
    "max_score": 0.4339554,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.4339554,
        "_source": {
          "name": "kürşat"
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "2",
        "_score": 0.09001608,
        "_source": {
          "name": "KURSAT"
        }
      }
    ]
  }

Now the 'preserve_original' flag will make sure that if a user types: 'kürşat', documents with that exact match will be ranked higher than documents that have 'kursat' (Notice the difference in scores for both query responses).

If you want the score to be equal, you can put the flag on false.

Hope I got your problem right!

like image 199
Byron Voorbach Avatar answered Jan 09 '23 15:01

Byron Voorbach