What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:
Given I build index:
I want to be able allow such conversions:
so if someone searches for: QUERY | RESULT(I include only ID's, but it would be full records in reality)
Starting with that I tried to create index analyzer and filter that:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => o",
"ö => oe"
]
}
}
}
}
}
But that is invalid, because it tries to map to same character.
What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.
Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode
isn't enough due ø and oe conversions. Example:
import unicodedata
def strip_accents(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
)
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
for b in body_matches:
print b,strip_accents(b)
>>>> Jorgensen Jorgensen
>>>> Jörgensen Jorgensen
>>>> Jørgensen Jørgensen
>>>> Joergensen Joergensen
So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.
accented_letters = {
u'ö' : [u'o',u'oe'],
u'ø' : [u'o',u'oe'],
}
Then, we can normalize words and store them in a special property, body_normalized
for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:
body
field, that isn't normalized too.body_normalized
fieldLet's see an example
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
elasticsearch_query = {
"query": {
"match" : {
"body" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
print "\n------SIMILAR MATCHES------"
for body_match in body_matches:
body_match = normalize_word(body_match)
elasticsearch_query = {
"query": {
"match" : {
"body_normalized" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
You can see a running example in this notebook
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With