Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mapping international character to multiple options

What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:

Given I build index:

  1. Jorgensen
  2. Jörgensen
  3. Jørgensen

I want to be able allow such conversions:

  1. ö to o
  2. ö to oe
  3. ø to oe
  4. ø to oe

so if someone searches for: QUERY | RESULT(I include only ID's, but it would be full records in reality)

  • Jorgensen return - 1,2,3
  • Jörgensen return - 1,2
  • Jørgensen return - 1,3
  • Joergensen return - 2,3

Starting with that I tried to create index analyzer and filter that:

{
"settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "ö => o",
            "ö => oe"
          ]
        }
      }
    }
  }
}

But that is invalid, because it tries to map to same character.

What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.

like image 797
Shawnas Avatar asked Feb 17 '17 15:02

Shawnas


1 Answers

Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode isn't enough due ø and oe conversions. Example:

import unicodedata
def strip_accents(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

body_matches = [
    u'Jorgensen',
    u'Jörgensen',
    u'Jørgensen',
    u'Joergensen',
]
for b in body_matches:
    print b,strip_accents(b)

>>>> Jorgensen Jorgensen
>>>> Jörgensen Jorgensen
>>>> Jørgensen Jørgensen
>>>> Joergensen Joergensen

So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.

accented_letters = {
    u'ö' : [u'o',u'oe'],
    u'ø' : [u'o',u'oe'],
}

Then, we can normalize words and store them in a special property, body_normalized for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:

  1. exact search: User input isn't normalized and Elasticsearch query search against body field, that isn't normalized too.
  2. simliar search. User input is normalized and we'll search againts body_normalized field

Let's see an example

body_matches = [
    u'Jorgensen',
    u'Jörgensen',
    u'Jørgensen',
    u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
    elasticsearch_query = {
        "query": {
            "match" : {
                "body" : body_match
            }
        }
    }
    es_kwargs = { 
        "doc_type"  : "your_type", 
        "index" : 'your_index', 
        "body" : elasticsearch_query
    }

    res = es.search(**es_kwargs)
    print body_match," MATCHING BODIES=",res['hits']['total']

    for r in res['hits']['hits']:
        print "-",r['_source'].get('body','')

print "\n------SIMILAR MATCHES------"
for body_match in body_matches:
    body_match = normalize_word(body_match)
    elasticsearch_query = {
        "query": {
            "match" : {
                "body_normalized" : body_match
            }
        }
    }
    es_kwargs = { 
        "doc_type"  : "your_type", 
        "index" : 'your_index', 
        "body" : elasticsearch_query
    }

    res = es.search(**es_kwargs)
    print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']

    for r in res['hits']['hits']:
        print "-",r['_source'].get('body','')

You can see a running example in this notebook

like image 129
xecgr Avatar answered Nov 11 '22 18:11

xecgr