I'm trying to migrate some of the queries of our old MySQL database to our new Elasticsearch setup. The data is a little bit more complex but boils down to the following:
I've got an index containing a lot of scores. Each score represents the points a player scored in a particular game.
{
"userId": 2,
"scoreId": 3457,
"game": {
"id": 6,
"name": "scrabble"
},
"date": 1340047100,
"score": 56,
// and more game data
}
scoreId
is the unique id for this score, game.id
is the id of that type of game.
{
"userId": 6,
"gameId": 3479,
"game": {
"id": 5,
"name": "risk"
},
"date": "1380067200",
"score": 100,
// and more game data
}
Over the years a lot of different games are played and I would like to rank the best players for each type of game. The ranking is based on the best 6 games of each player. So for example, if a player played scrabble 10 times, only its 6 best scores count for its total score.
I would like to create a list like:
// Scrabble ranking:
# | user | total points
1 | 2 | 4500
2 | 6 | 3200
2 | 23 | 1500
The reason for the migration is that the old MySQL queries first get a list of all the distinct users for each game, and then executes another query for EACH user to get its best 6 scores. I hoped that I could use the aggregates of elastic to do it all in just one query but so far I can't make it work.
The problem is that after a couple of hours of reading the elastic docs it seems that my problem is more complex than the examples. Maybe if someone can point me a bit in the right direction I can continue my search. At least this is not getting me anywhere:
GET /my-index/scores/_search
{
"query": {
"bool": {
"filter": [
{"term": { "game.id": 6 }}
]
}
},
"aggs": {
"scores": {
"terms": {
"field": "userId"
}
},
"top_scores_user": {
"top_hits": {
"sort": [{
"score": {
"order": "desc"
}
}],
"size" : 6
}
}
},
"size": 0
}
I'm using elastic 2.3 but there's a chance I could upgrade if it's really necessary.
Before scoring documents, Elasticsearch first reduces the set of candidate documents by applying a boolean test that only includes documents that match the query. A score is then calculated for each document in this set, and this score determines how the documents are ordered.
Elasticsearch uses search relevance to score documents of a dataset. It returns an ordered list of data sorted by a relevance score. We can customize the score by adding and modifying variables that will shift the scale between precision and recall.
One of the benefits of using Elasticsearch is that it supports the use of Max Score, which can be used to improve the accuracy of search results. Max Score is a scoring algorithm that takes into account the relevance of a document to a query, as well as the number of times the document has been viewed.
You can achieve it simply by removing the boost_mode parameter, the default boost_mode is to multiply the _score with whatever value comes out of the field_value_factor function.
Using top_hits
will not let you achieve what you need, because you cannot act upon the fields that are returned for each documents in the top hits aggregation.
One way to get around this is to use a top-level terms
aggregation for users (as you did) and then for each user another terms
sub-aggregation for the scores that you can sort in decreasing order and taking only the 6 best ones. Finally, using a pipeline sum_bucket
aggregation, you can sum up those 6 scores for each user.
POST /my-index/scores/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"game.id": 6
}
}
]
}
},
"aggs": {
"users": {
"terms": { <--- segment by user
"field": "userId"
},
"aggs": {
"best_scores": {
"terms": { <--- 6 best scores for user
"field": "score",
"order": {
"_term": "desc"
},
"size": 6
},
"aggs": {
"total_score": {
"sum": {
"field": "score"
}
}
}
},
"total_points": { <--- total points for the user based on 6 best scores
"sum_bucket": {
"buckets_path": "best_scores > total_score"
}
}
}
}
}
}
Note that one drawback of this solution is if the user had twice the exact same score, you'll get the 7 best scores and not the 6 best ones and the total_score
value will be too high. We could use the avg
instead of sum
metric aggregation, but if we do this, we'll ignore one of the score occurrence, which is not good either.
Also note that it would be ideal to sort the users according to their total_points
value, but it is not possible to sort using pipeline aggregations (since they run after the reduce phase). The sorting will need to happen on the client side.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With