I wonder whether there are recommendations for using analyzers / filters to index/search for human names.
Examples of names that might pose difficulties:
thx Marc
Here's an analyzer and filter to get you started. It's hard to cover all the cases, but an asciifolding
filter will solve your issues with the François versus Francois case.
In the example below, it will preserve the original so that a query for both François and Francois will resolve to the same resultset.
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"my_ascii_folding"
]
}
},
"filter": {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
Source
By defining a synonym filter, you can define a list of commonly similar names in your language (maybe a line like François => Francois
in your synonyms file for example) that will do the trick in the short run.
Lastly a pattern_replace
char filter with a "([A-Za-z]+)ae([A-Za-z]+)" => "$1a$2"
pattern can turn all the Verhaeven
into Verhaven
Something like...
"char_filter": {
"ae_char_filter": {
"type": "pattern_replace",
"pattern": "([A-Za-z]+)ae([A-Za-z]+)",
"replacement": "$1a$2"
}
}
Even Peter de Langhe versus Peter delange can be solved with a pattern_replace
char filter:
"char_filter": {
"de_char_filter": {
"type": "pattern_replace",
"pattern": "([A-Za-z]+) de ([A-Za-z]+)",
"replacement": "$1 de$2"
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With