I am performing a fuzzy elasticsearch query on 'text' and 'keywords' fields. I have two documents in elasticsearch, one with 'text' "testPhone 5" and the other "testPhone 4s". When I perform a fuzzy query with "testPhone 5", I am seeing that both documents are being given the exact same score value. Why is this occurring?
Extra info: I am indexing documents using the 'uax_url_email' tokenizer and 'lowercase' filter.
This is the query I am making:
{
query : {
bool: {
// match one or the other fuzzy query
should: [
{
fuzzy: {
text: {
min_similarity: 0.4,
value: 'testphone 5',
prefix_length: 0,
boost: 5,
}
}
},
{
fuzzy: {
keywords: {
min_similarity: 0.4,
value: 'testphone 5',
prefix_length: 0,
boost: 1,
}
}
}
]
}
},
sort: [
'_score'
],
explain: true
}
This is the result:
{ max_score: 0.47213298,
total: 2,
hits:
[ { _index: 'test',
_shard: 0,
_id: '51fbf95f82e89ae8c300002c',
_node: '0Mtfzbe1RDinU71Ordx-Ag',
_source:
{ next: { id: '51fbf95f82e89ae8c3000027' },
cards: [ '51fbf95f82e89ae8c3000027', [length]: 1 ],
other: false,
_id: '51fbf95f82e89ae8c300002c',
category: '51fbf95f82e89ae8c300002b',
image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
text: 'testPhone 5',
keywords: [ [length]: 0 ],
__v: 0 },
_type: 'productgroup',
_explanation:
{ details:
[ { details:
[ { details:
[ { details:
[ { details:
[ { value: 3.8888888, description: 'boost' },
{ value: 1.5108256,
description: 'idf(docFreq=2, maxDocs=5)' },
{ value: 0.17020021,
description: 'queryNorm' },
[length]: 3 ],
value: 0.99999994,
description: 'queryWeight, product of:' },
{ details:
[ { details:
[ { value: 1, description: 'termFreq=1.0' },
[length]: 1 ],
value: 1,
description: 'tf(freq=1.0), with freq of:' },
{ value: 1.5108256,
description: 'idf(docFreq=2, maxDocs=5)' },
{ value: 0.625,
description: 'fieldNorm(doc=0)' },
[length]: 3 ],
value: 0.944266,
description: 'fieldWeight in 0, product of:' },
[length]: 2 ],
value: 0.94426596,
description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
[length]: 1 ],
value: 0.94426596,
description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
[length]: 1 ],
value: 0.94426596,
description: 'sum of:' },
{ value: 0.5, description: 'coord(1/2)' },
[length]: 2 ],
value: 0.47213298,
description: 'product of:' },
_score: 0.47213298 },
{ _index: 'test',
_shard: 4,
_id: '51fbf95f82e89ae8c300002d',
_node: '0Mtfzbe1RDinU71Ordx-Ag',
_source:
{ next: { id: '51fbf95f82e89ae8c3000027' },
cards: [ '51fbf95f82e89ae8c3000029', [length]: 1 ],
other: false,
_id: '51fbf95f82e89ae8c300002d',
category: '51fbf95f82e89ae8c300002b',
image: 'https://s3.amazonaws.com/sold_category_icons/Smartphones.png',
text: 'testPhone 4s',
keywords: [ 'apple', [length]: 1 ],
__v: 0 },
_type: 'productgroup',
_explanation:
{ details:
[ { details:
[ { details:
[ { details:
[ { details:
[ { value: 3.8888888, description: 'boost' },
{ value: 1.5108256,
description: 'idf(docFreq=2, maxDocs=5)' },
{ value: 0.17020021,
description: 'queryNorm' },
[length]: 3 ],
value: 0.99999994,
description: 'queryWeight, product of:' },
{ details:
[ { details:
[ { value: 1, description: 'termFreq=1.0' },
[length]: 1 ],
value: 1,
description: 'tf(freq=1.0), with freq of:' },
{ value: 1.5108256,
description: 'idf(docFreq=2, maxDocs=5)' },
{ value: 0.625,
description: 'fieldNorm(doc=0)' },
[length]: 3 ],
value: 0.944266,
description: 'fieldWeight in 0, product of:' },
[length]: 2 ],
value: 0.94426596,
description: 'score(doc=0,freq=1.0 = termFreq=1.0\n), product of:' },
[length]: 1 ],
value: 0.94426596,
description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
[length]: 1 ],
value: 0.94426596,
description: 'sum of:' },
{ value: 0.5, description: 'coord(1/2)' },
[length]: 2 ],
value: 0.47213298,
description: 'product of:' },
_score: 0.47213298 },
[length]: 2 ] }
Fuzzy queries are not analyzed but the field is so your search for testphone 5
with a distance of 0.4
yields the analyzed term testphone
for both documents and that term is used to further filter down the results
description: 'weight(text:testphone^3.8888888 in 0) [PerFieldSimilarity], result of:' },
See also @imotov excellent answer here: ElasticSearch's Fuzzy Query
You can see how exactly a string will be tokenized using the _analyze
API
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html
i.e
http://localhost:9200/prefix_test/_analyze?field=text&text=testphone+5
will return:
{
"tokens": [
{
"token": "testphone",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "5",
"start_offset": 10,
"end_offset": 11,
"type": "<NUM>",
"position": 2
}
]
}
So even if you index the value testphone sammsung
a fuzzy query for "testphone samsunk" won't yield anything where as just samsunk
will.
You may get better results by not analyzing (or using the keyword analyzer) the field.
If you want to have different analysis on a single field you can use the multi_field
construct.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-multi-field-type.html
I ran into this issue myself recently. I can't tell you exactly why it is happening, but I CAN tell you how I fixed it:
I ran 2 queries over the same field, one with an exact match, and then the exact same query on the same field with fuzzy matches enabled and a lower boost.
That made sure that my exact matches always ended higher then the fuzzy matches.
P.S. I think they're scored equal because, because of the fuzziness, the both match and ES doesn't care that one is an exact match as long as the both match, but this is pure theory crafting on my end since i'm not intimately familiar with the scoring algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With