There are similar questions asked here Elasticsearch Map case insensitive to not_analyzed documents, however mine is a slightly different because I deal with special characters.
Most people recommend using a keyword analyzer
combined with lowercase filter
. However, this does not work for my case because keyword analyzer tokenizes on spaces, and special characters like ^, #, etc
. which break the type of support I'm going for.
i.e.
^HELLOWORLD
should be matched by searching ^helloworld
, but not helloworld
#FooBar
should be matched by #foobar
but not foobar
.Foo Bar
should be matched by foo bar
, but not foo
or bar
.Similar functionality with what we see here https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_exact_values.html#_term_filter_with_numbers, but with case insensitivity.
Does anyone know how to accomplish this?
It seems the core of my problem was with multi-field, as keyword+lowercase seems to resolve the question posed in the title. However, it would be more accurate to pose this question for a multi-field value property.
test_mapping.json:
{
"properties" : {
"productID1" : {
"type" : "string",
"index_analyzer" : "keyword_lowercase",
"search_analyzer" : "keyword_lowercase"
},
"productID2" : {
"type": "multi_field",
"keyword_edge_ID": {
"type": "string",
"index_analyzer":"keyword_lowercase_edge",
"search_analyzer":"keyword_lowercase_edge"
},
"productID2": {
"type": "string",
"index": "analyzed",
"store": "yes",
"index_analyzer":"keyword_lowercase",
"search_analyzer":"keyword_lowercase"
}
}
}
}
test.json:
{
"index": {
"analysis": {
"filter":{
"edgengramfilter": {
"type": "edgeNgram",
"side": "front",
"min_gram": 1,
"max_gram": 32
}
},
"analyzer": {
"keyword_lowercase" : {
"type" : "custom",
"tokenizer": "keyword",
"filter": "lowercase"
},
"keyword_lowercase_edge": {
"tokenizer": "keyword",
"filter": ["lowercase", "edgengramfilter"]
}
}
}
}
}
Shell script to create index with mappings:
#!/bin/sh
ES_URL="http://localhost:9200"
curl -XDELETE $ES_URL/test
curl -XPOST $ES_URL/test/ --data-binary @test.json
curl -XPOST $ES_URL/test/query/_mapping --data-binary @test_mapping.json
POST localhost:9200/test/query
:
{
"productID1" : "^A",
"productID2" : "^A"
}
I'd like it so that I can match against productID2 with "^A", but it is returning no results right now, but it works when I do the same query against productID1. {"query": { "match": { "productID2": "^A" }}}
As you can see in the example below, the keyword
tokenizer and lowercase
filter is doing exactly that - it lowercases the entire value while preserving all spaces and special characters. The example of how to use it can be found in this answer.
curl "localhost:9200/_analyze?pretty&tokenizer=keyword&filters=lowercase" -d "^HELLOWORLD"
{
"tokens" : [ {
"token" : "^helloworld",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 1
} ]
}
curl "localhost:9200/_analyze?pretty&tokenizer=keyword&filters=lowercase" -d "#FooBar"
{
"tokens" : [ {
"token" : "#foobar",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
} ]
}
curl "localhost:9200/_analyze?pretty&tokenizer=keyword&filters=lowercase" -d "Foo Bar"
{
"tokens" : [ {
"token" : "foo bar",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
} ]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With