Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "rescore": [
    {
      "window_size": 10000,
      "query": {
        "rescore_query": {
          "function_score": {
            "boost_mode": "replace",
            "script_score": {
              "script": {
                "source": "doc['topic_score'].value"
              }
            }
          }
        },
        "query_weight": 0,
        "rescore_query_weight": 1
      }
    }
  ],
  "aggs": {
    "distinct": {
      "terms": {
        "field": "identical_id",
        "order": {
          "top_score": "desc"
        }
      },
      "aggs": {
        "best_unique_result": {
          "top_hits": {
            "size": 1
          }
        },
        "top_score": {
          "max": {
            "script": {
              "inline": "_score"
            }
          }
        }
      }
    }
  }
}

This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.

Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.

I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.

The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.

Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?

like image 308
LaserJesus Avatar asked Dec 18 '25 15:12

LaserJesus


1 Answers

From documentation :

The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.

As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.

I have no idea on how you can combine rescore and aggregations. Sorry :(

like image 63
Pierre Mallet Avatar answered Dec 20 '25 18:12

Pierre Mallet



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!