Struggling to make iPhone match when searching for iphone in Elasticsearch.
Since I have some source code at stake, I surely need CamelCase tokenizer, but it appears to break iPhone into two terms, so iphone can't be found.
Anyone knows of a way to add exceptions to breaking camelCase words into tokens (camel + case)?
UPDATE: to make it clear, I want NullPointerException to be tokenized as [null, pointer, exception], but I don't want iPhone to become [i, phone].
Any other solution?
UPDATE 2: @ChintanShah's answer suggests a different approach that gives us even more - NullPointerException will be tokenized as [null, pointer, exception, nullpointer, pointerexception, nullpointerexception], which is definitely much more useful from the point of view of the one that searches. And indexing is also faster! Price to pay is index size, but it is a superior solution.
You can achieve your requirements with word_delimiter token filter. This is my setup
{
"settings": {
"analysis": {
"analyzer": {
"camel_analyzer": {
"tokenizer": "whitespace",
"filter": [
"camel_filter",
"lowercase",
"asciifolding"
]
}
},
"filter": {
"camel_filter": {
"type": "word_delimiter",
"generate_number_parts": false,
"stem_english_possessive": false,
"split_on_numerics": false,
"protected_words": [
"iPhone",
"WiFi"
]
}
}
}
},
"mappings": {
}
}
This will split the words on case changes so NullPointerException
will be tokenized as null, pointer and exception but iPhone and WiFi will remain as it is as they are protected. word_delimiter
has lot of options for flexibility. You can also preserve_original which will help you a lot.
GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer
Result
{
"tokens": [
{
"token": "iphone",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}
Now with
GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer
Result
{
"tokens": [
{
"token": "null",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "pointer",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "exception",
"start_offset": 11,
"end_offset": 20,
"type": "word",
"position": 3
}
]
}
Another approach is to analyze your field twice with different analyzers but I feel word_delimiter will do the trick.
Does this help?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With