As far as I read in the ES documentation:
( source: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html )
From those two statements, I understand that the following steps are executed:
Problem:
I may have a char_filter that turns multiple letters at once.
Example: ph -> f.
However, "PH" won't be turned into "f", because "PH" is not part of the mapping.
So, the analysis of "philipp" retrieves "filipp", whereas "Philipp" retrieves "philipp".
Working with both upper and lowercase (to achieve the same result in both cases), the number of mappings in the char_filter will be (number of characters)².
Example: ph -> f; Ph -> F; pH -> f; PH -> F.
I wouldn't be a problem if I only had 4 mappings, but if I need more x² mappings, the char_filter tends to become a big mess.
Example of index:
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"default_index" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [
"lowercase"
],
"char_filter" : [
"misc_simplifications"
]
}
},
"char_filter" : {
"misc_simplifications" : {
"type" : "mapping",
"mappings" : [
"ph=>f","Ph=>F","pH=>f","PH=>F"
]
}
}
}
}
}
}
Philosophical question:
I understand that I may want to treat "ph" and "Ph" equally, but "pH" could mean something totally different. But is there a way of turning the characters into lowercase before the char_filter phase? Does it make sense?
Because that big mapping gives me the feeling that I am doing something wrong or that I can find an easier (more elegant) solution.
you're correct in the sequence of steps:
However, the main purpose of the CharFilter is to clean up the data to make the tokenisation easier. For example by stripping out the XML tags or replacing a delimiter with a space character.
So - I would put misc_simplifications
as a TokenFilter to be applied after the Lowercase filter.
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"default_index" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [
"lowercase",
"misc_simplifications"
]
}
},
"filter" : {
"misc_simplifications" : {
"type" : "pattern_replace",
"pattern": "ph",
"replacement":"f"
}
}
}
}
}
}
Note I've used pattern replace instead of mappings. You could also modify the regexp to only replace where "ph" is at the beginning of the token.
Also - your mappings look like phonetic replacements. I'm not sure of your requirements, but it looks like possibly the phonetic token filter would help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With