Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With