So I have this index
{
"settings":{
"index":{
"number_of_replicas":0,
"analysis":{
"analyzer":{
"default":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase",
"my_ngram"
]
}
},
"filter":{
"my_ngram":{
"type":"nGram",
"min_gram":2,
"max_gram":20
}
}
}
}
}
}
and I'm performing this search through the tire gem
{
"query":{
"query_string":{
"query":"xyz",
"default_operator":"AND"
}
},
"sort":[
{
"count":"desc"
}
],
"filter":{
"term":{
"active":true,
"_type":null
}
},
"highlight":{
"fields":{
"name":{
}
},
"pre_tags":[
"<strong>"
],
"post_tags":[
"</strong>"
]
}
}
and I have two posts that should match named 'xyz post' and 'xyz question' When I perform this search, I get the highlighted fields back properly
<strong>xyz</strong> question
<strong>xyz</strong> post
Now here's the thing ... as soon as I change min_gram to 1 in my index and reindex. the highlighted fields start coming back as this
<strong>x</strong><strong>y</strong><strong>z</strong> pos<strong>xyz</strong>t
<strong>x</strong><strong>y</strong><strong>z</strong> questio<strong>xyz</strong>n
I simply cannot understand why.
You need to check your mapping and see if you use fast-vector-highlighter
. But still you need to be quite careful about your queries.
Assume using fresh instance of ES 0.20.4
on localhost
.
Building on top of your example, let's add explicit mappings. Note I setup two different analysis for the code
field. The only difference is "term_vector":"with_positions_offsets"
.
curl -X PUT localhost:9200/myindex -d '
{
"settings" : {
"index":{
"number_of_replicas":0,
"number_of_shards":1,
"analysis":{
"analyzer":{
"default":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase",
"my_ngram"
]
}
},
"filter":{
"my_ngram":{
"type":"nGram",
"min_gram":1,
"max_gram":20
}
}
}
}
},
"mappings" : {
"product" : {
"properties" : {
"code" : {
"type" : "multi_field",
"fields" : {
"code" : {
"type" : "string",
"analyzer" : "default",
"store" : "yes"
},
"code.ngram" : {
"type" : "string",
"analyzer" : "default",
"store" : "yes",
"term_vector":"with_positions_offsets"
}
}
}
}
}
}
}'
Index some data.
curl -X POST 'localhost:9200/myindex/product' -d '{
"code" : "Samsung Galaxy i7500"
}'
curl -X POST 'localhost:9200/myindex/product' -d '{
"code" : "Samsung Galaxy 5 Europa"
}'
curl -X POST 'localhost:9200/myindex/product' -d '{
"code" : "Samsung Galaxy Mini"
}'
And now we can run queries.
curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
"fields" : [ "code" ],
"query" : {
"term" : {
"code" : "i"
}
},
"highlight" : {
"number_of_fragments" : 0,
"fields" : {
"code":{},
"code.ngram":{}
}
}
}'
This yields two search hits:
# 1
...
"fields" : {
"code" : "Samsung Galaxy Mini"
},
"highlight" : {
"code.ngram" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ],
"code" : [ "Samsung Galaxy M<em>i</em>n<em>i</em>" ]
}
# 2
...
"fields" : {
"code" : "Samsung Galaxy i7500"
},
"highlight" : {
"code.ngram" : [ "Samsung Galaxy <em>i</em>7500" ],
"code" : [ "Samsung Galaxy <em>i</em>7500" ]
}
Both the code
and code.ngem
fields were correctly highlighted this time. But things change quickly when longer query is used:
curl -X GET 'localhost:9200/myindex/product/_search?pretty' -d '{
"fields" : [ "code" ],
"query" : {
"term" : {
"code" : "y m"
}
},
"highlight" : {
"number_of_fragments" : 0,
"fields" : {
"code":{},
"code.ngram":{}
}
}
}'
This yields:
"fields" : {
"code" : "Samsung Galaxy Mini"
},
"highlight" : {
"code.ngram" : [ "Samsung Galax<em>y M</em>ini" ],
"code" : [ "Samsung Galaxy Min<em>y M</em>i" ]
}
The code
fields is not highlighted correctly (similar to your case).
One important thing is that term query is used instead of query_string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With