I am trying to enable full text search across the tags (keyword phrases) I have created that can be assigned to documents in my index (named "Delta").
My results are (1) not what I would expect and (2) not consistent if I re-run the same code repeatedly.
Below is some code. I have simplified the mappings and documents to make the code clearer and to make sure the problem wasn't in some other part of the documents or mappings. I am running all of this using the Kibana Dev Tools Console.
PUT /mdelta
{
"mappings":{
"tags":{
"properties":{
"synonyms":{
"type":"text"
}
}
}
}
}
POST _bulk
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Fe"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Deficiency"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Serum Iron"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Sulfate"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Deficiency Anemia"}
GET mdelta/tags/_search
{
"explain":false,
"query": {
"match" : {
"synonyms" : "iron"
}
}
}
Based on my understanding of the scoring algorithm, I would expect the document {"synonyms":"Iron"} to be returned first (top score). This is not the case. Results ...
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.5377023,
"hits": [
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj9",
"_score": 0.5377023,
"_source": {
"synonyms": "Iron Sulfate"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj5",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj8",
"_score": 0.25811607,
"_source": {
"synonyms": "Serum Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj7",
"_score": 0.1805489,
"_source": {
"synonyms": "Iron Deficiency"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj-",
"_score": 0.14638957,
"_source": {
"synonyms": "Iron Deficiency Anemia"
}
}
]
}
}
I repeated the query with explain set to true.
{
"took": 38,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.5377023,
"hits": [
{
"_shard": "[mdelta][4]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj9",
"_score": 0.5377023,
"_source": {
"synonyms": "Iron Sulfate"
},
"_explanation": {
"value": 0.5377023,
"description": "weight(synonyms:iron in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.5377023,
"description": "score(doc=1,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.6931472,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.7757405,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 1.5,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][2]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj5",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron"
},
"_explanation": {
"value": 0.2876821,
"description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.2876821,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 1,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 1,
"description": "avgFieldLength",
"details": []
},
{
"value": 1,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][3]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj8",
"_score": 0.25811607,
"_source": {
"synonyms": "Serum Iron"
},
"_explanation": {
"value": 0.25811607,
"description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.25811607,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.2876821,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 1,
"description": "docFreq",
"details": []
},
{
"value": 1,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.89722675,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][1]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj7",
"_score": 0.1805489,
"_source": {
"synonyms": "Iron Deficiency"
},
"_explanation": {
"value": 0.1805489,
"description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.1805489,
"description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.18232156,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 2,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.9902773,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2.5,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
},
{
"_shard": "[mdelta][1]",
"_node": "McQ619KqR0akS1mHvTXjDw",
"_index": "mdelta",
"_type": "tags",
"_id": "AWA8jRR9YXA6OBvYOfj-",
"_score": 0.14638957,
"_source": {
"synonyms": "Iron Deficiency Anemia"
},
"_explanation": {
"value": 0.14638956,
"description": "weight(synonyms:iron in 1) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.14638956,
"description": "score(doc=1,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 0.18232156,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 2,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
{
"value": 0.8029196,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2.5,
"description": "avgFieldLength",
"details": []
},
{
"value": 4,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
}
}
]
}
}
If you look at the first hit ("Iron Sulfate"), it appears that the docFreq is 1 and the docCount is 2. This is incorrect.
In addition, if I run delete /mdelta and then re-run my code, I can get a different order of the results for example ...
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.2876821,
"hits": [
{
"_index": "mdelta",
"_type": "tags",
"_id": "Qd0JQWABt4cFDxBHv7Fe",
"_score": 0.2876821,
"_source": {
"synonyms": "Serum Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "Pt0JQWABt4cFDxBHv7Fe",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "QN0JQWABt4cFDxBHv7Fe",
"_score": 0.2876821,
"_source": {
"synonyms": "Iron Deficiency"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "Qt0JQWABt4cFDxBHv7Fe",
"_score": 0.19856805,
"_source": {
"synonyms": "Iron Sulfate"
}
},
{
"_index": "mdelta",
"_type": "tags",
"_id": "Q90JQWABt4cFDxBHv7Fe",
"_score": 0.16853254,
"_source": {
"synonyms": "Iron Deficiency Anemia"
}
}
]
}
}
Any ideas about what I am doing wrong would be greatly appreciated.
The reason for not getting consistent results on reindexing the data is that the term-frequencies are calculated per shard. On reindexing, the shard allocation differs from the previous index since you don't specify any routing.
The problem:
not getting what [you] expect
from elastic is perhaps because of the small number of documents in your index. Try running the query with parameter search_type like so: GET mdelta/tags/_search?search_type= dfs_query_then_fetch.
This ensures that it calculates index level frequencies first.
You can use this in development, but i don't think it's advisable in production. If you have enough data, the frequencies should be more or less the same across shards.
see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With