Mapping international character to multiple options

Question

What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:

Given I build index:

Jorgensen
Jörgensen
Jørgensen

I want to be able allow such conversions:

ö to o
ö to oe
ø to oe
ø to oe

so if someone searches for: QUERY | RESULT(I include only ID's, but it would be full records in reality)

Jorgensen return - 1,2,3
Jörgensen return - 1,2
Jørgensen return - 1,3
Joergensen return - 2,3

Starting with that I tried to create index analyzer and filter that:

{
"settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "ö => o",
            "ö => oe"
          ]
        }
      }
    }
  }
}

But that is invalid, because it tries to map to same character.

What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.

xecgr · Accepted Answer

Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode isn't enough due ø and oe conversions. Example:

import unicodedata
def strip_accents(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

body_matches = [
    u'Jorgensen',
    u'Jörgensen',
    u'Jørgensen',
    u'Joergensen',
]
for b in body_matches:
    print b,strip_accents(b)

>>>> Jorgensen Jorgensen
>>>> Jörgensen Jorgensen
>>>> Jørgensen Jørgensen
>>>> Joergensen Joergensen

So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.

accented_letters = {
    u'ö' : [u'o',u'oe'],
    u'ø' : [u'o',u'oe'],
}

Then, we can normalize words and store them in a special property, body_normalized for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:

exact search: User input isn't normalized and Elasticsearch query search against body field, that isn't normalized too.
simliar search. User input is normalized and we'll search againts body_normalized field

Let's see an example

body_matches = [
    u'Jorgensen',
    u'Jörgensen',
    u'Jørgensen',
    u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
    elasticsearch_query = {
        "query": {
            "match" : {
                "body" : body_match
            }
        }
    }
    es_kwargs = { 
        "doc_type"  : "your_type", 
        "index" : 'your_index', 
        "body" : elasticsearch_query
    }

    res = es.search(**es_kwargs)
    print body_match," MATCHING BODIES=",res['hits']['total']

    for r in res['hits']['hits']:
        print "-",r['_source'].get('body','')

print "
------SIMILAR MATCHES------"
for body_match in body_matches:
    body_match = normalize_word(body_match)
    elasticsearch_query = {
        "query": {
            "match" : {
                "body_normalized" : body_match
            }
        }
    }
    es_kwargs = { 
        "doc_type"  : "your_type", 
        "index" : 'your_index', 
        "body" : elasticsearch_query
    }

    res = es.search(**es_kwargs)
    print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']

    for r in res['hits']['hits']:
        print "-",r['_source'].get('body','')

You can see a running example in this notebook

Mapping international character to multiple options

Tags:

elasticsearch

Shawnas

1 Answers

xecgr

Recent Activity

Donate For Us

Mapping international character to multiple options

Tags:

elasticsearch

Shawnas

1 Answers

xecgr

Related questions

Recent Activity

Donate For Us