Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch - Rank userIds based on score

I'm trying to migrate some of the queries of our old MySQL database to our new Elasticsearch setup. The data is a little bit more complex but boils down to the following:

I've got an index containing a lot of scores. Each score represents the points a player scored in a particular game.

{
  "userId": 2,
  "scoreId": 3457,
  "game": {
    "id": 6,
    "name": "scrabble"
  },
  "date": 1340047100,
  "score": 56,
  // and more game data
}

scoreId is the unique id for this score, game.id is the id of that type of game.

{
  "userId": 6,
  "gameId": 3479,
  "game": {
    "id": 5,
    "name": "risk"
  },
  "date": "1380067200",
  "score": 100,
  // and more game data
}

Over the years a lot of different games are played and I would like to rank the best players for each type of game. The ranking is based on the best 6 games of each player. So for example, if a player played scrabble 10 times, only its 6 best scores count for its total score.

I would like to create a list like:

// Scrabble ranking:
# | user | total points  
1 |  2   | 4500
2 |  6   | 3200
2 |  23  | 1500

The reason for the migration is that the old MySQL queries first get a list of all the distinct users for each game, and then executes another query for EACH user to get its best 6 scores. I hoped that I could use the aggregates of elastic to do it all in just one query but so far I can't make it work.

The problem is that after a couple of hours of reading the elastic docs it seems that my problem is more complex than the examples. Maybe if someone can point me a bit in the right direction I can continue my search. At least this is not getting me anywhere:

GET /my-index/scores/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": { "game.id": 6 }}
      ]
    }
  },
  "aggs": {
    "scores": {
      "terms": {
        "field": "userId"
      }
    },
    "top_scores_user": {
      "top_hits": {
        "sort": [{
          "score": {
            "order": "desc"
          }
        }],
        "size" : 6
      }
    }
  },
   "size": 0
}

I'm using elastic 2.3 but there's a chance I could upgrade if it's really necessary.

like image 443
Tieme Avatar asked May 03 '17 15:05

Tieme


People also ask

How does Elasticsearch calculate score?

Before scoring documents, Elasticsearch first reduces the set of candidate documents by applying a boolean test that only includes documents that match the query. A score is then calculated for each document in this set, and this score determines how the documents are ordered.

What is Elasticsearch relevance score?

Elasticsearch uses search relevance to score documents of a dataset. It returns an ordered list of data sorted by a relevance score. We can customize the score by adding and modifying variables that will shift the scale between precision and recall.

What is Elasticsearch max score?

One of the benefits of using Elasticsearch is that it supports the use of Max Score, which can be used to improve the accuracy of search results. Max Score is a scoring algorithm that takes into account the relevance of a document to a query, as well as the number of times the document has been viewed.

How do I change Elasticsearch score?

You can achieve it simply by removing the boost_mode parameter, the default boost_mode is to multiply the _score with whatever value comes out of the field_value_factor function.


1 Answers

Using top_hits will not let you achieve what you need, because you cannot act upon the fields that are returned for each documents in the top hits aggregation.

One way to get around this is to use a top-level terms aggregation for users (as you did) and then for each user another terms sub-aggregation for the scores that you can sort in decreasing order and taking only the 6 best ones. Finally, using a pipeline sum_bucket aggregation, you can sum up those 6 scores for each user.

POST /my-index/scores/_search    
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "game.id": 6
          }
        }
      ]
    }
  },
  "aggs": {
    "users": {
      "terms": {              <--- segment by user
        "field": "userId"
      },
      "aggs": {
        "best_scores": {
          "terms": {          <--- 6 best scores for user
            "field": "score",
            "order": {
              "_term": "desc"
            },
            "size": 6
          },
          "aggs": {
            "total_score": {
              "sum": {
                "field": "score"
              }
            }
          }
        },
        "total_points": {     <--- total points for the user based on 6 best scores
          "sum_bucket": {
            "buckets_path": "best_scores > total_score"
          }
        }
      }
    }
  }
}

Note that one drawback of this solution is if the user had twice the exact same score, you'll get the 7 best scores and not the 6 best ones and the total_score value will be too high. We could use the avg instead of sum metric aggregation, but if we do this, we'll ignore one of the score occurrence, which is not good either.

Also note that it would be ideal to sort the users according to their total_points value, but it is not possible to sort using pipeline aggregations (since they run after the reduce phase). The sorting will need to happen on the client side.

like image 115
Val Avatar answered Sep 29 '22 13:09

Val