Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huge Leaderboard ranking with filtering

We are building a massive multi-player educational game with some millions of entries in the leader-board (based on aggregated XPs gained). After a game finishes, we need to show the leaderboard and how this player/student is ranked. But there are a couple of filters for this leaderboard (global/by country, by month/year/today, by age etc) that can be mixed together e.g. 'Get me the leaderboard for my Country for the last month'. Number of combinations is ~20.

My problem is how to store such a structure that is updated regularly; recalculation of rankings must be done after each game. A typical full leaderboard at the moment has ~5 millions of entries for players coming from >150 countries.

  1. I used to have a MySQL Cluster Table (userid, xps, countryid) with 3 nodes, but ordering by XPs (either in DBMS or application which required all data from DB) proven to be too slow as numbers got bigger (>20K of users). This is an interesting post but again half a second for each query is too much.

  2. Then we used REDIS (see this post), but filtering is the problem here. We used separate lists for TOP 5 and the rest. TOP 5 was updated instantly, for the rest there was some delay of 20-30 minutes. We in fact ranked this user based on a cached instance of the Leaderboard (using the real XPs though, not the cached), so this was acceptable. Real-time on non-Top5 is not a prerequisite. This is fine for one global ranking, but how to filter the results based on month and/or country and/or age. Do we need to keep a list for every filtering combination?

  3. We also tested custom structures in Java (using it as a Java caching server similar in functionality with REDIS), still experimenting with it. Which is the best combination of structures to achieve our goal? We ended up using one list per filtering combination e.g. Map<FilteringCombination, SortedList<User>> and then doing binary search to the list of a specific key. This way, a finished game requires a couple of insertions say X, but it requires X*NumOfPlayers space, which is X times more than keeping a single list (not sure if this can fit to memory but we can always create a cluster here by splitting combinations to different servers). There is an issue here on how to rebuild the cache in case of failure, but that is another problem we can deal with.

  4. Extending the above method, we might slightly improve performance if we define scoring buckets inside each list (eg a bucket for 0-100xp, another for 101 - 1000xp, another for 1001 - 10000xp etc). The bucket splitting policy will be based on the players' xp distribution in our game. It's true that this distribution is dynamic in real world, but we have seen that after a few months changes are minor, having in mind that XPs are always increasing but new users are coming as well.

  5. We are also testing Cassandra's natural ordering by utilizing clustering keys and white-rows feature, although we know that having some millions of rows may not be easy to handle.

All in all, that is what we need to achieve. If a user (let's name her UserX) is not included in the Top5 list, we need to show this user's ranking together with some surrounding players (eg 2 above and 2 below) as the example below:

    Global TOP 5        My Global Ranking (425)   My Country Ranking     Other Rankings      
1. karen (12000xp)          423. george              1. david    
2. greg (11280xp)           424. nancy               2. donald 
3. philips (10293xp)      **425. UserX**             3. susan
4. jason (9800xp)           426. rebecca           **4. UserX** 
5. barbara (8000xp)         427. james               5. teresa

I've studied many SO or other posts, but still cannot find a solution for efficiently updating and filtering large Leaderboard tables. Which one candidate solution would you choose and what are the possible performance improvements (space + memory + (Insertion/Searching CPU cost))?

like image 984
Kostas Chalkias Avatar asked Dec 30 '14 13:12

Kostas Chalkias


People also ask

How would you design a real time leaderboard for a million users?

The general idea will be as follows: Extract all the information required to build the Leaderboard from your Points table in MySQL. What we will extract here is the total sum of points for each of the Users in the database. The result set will the IDs of the users together with the total Points they have earned.

How do you create a leaderboard?

Design A Leaderboard in C++addScore(playerId, score) − This will update the leaderboard by adding score to the given player's score. When there is no such player with given id in the leaderboard, add him to the leaderboard with the given score. top(K) − This will return the score sum of the top K players.

Which service would you use if want to store leaderboards of gaming applications efficiently?

Redis offers one highly efficient and scalable solution. Redis is an in-memory data store that supports basic key-value functionality. It also supports a variety of data structures such as hashes, lists, sets, sorted sets, range queries, and geospatial indexes.

How is leaderboard rank determined?

Score by points, 1st place received 100 points, 2nd receives 95 points, etc. The overall competitors with the highest point wins. The point breakdown can be found on the event workout page. Cumulative score, sums up your raw workout scores to determine your overall rank.


2 Answers

That's a very interesting problem - thanks for posting. In general databases excel at this type of problem in which there is large amounts of data that needs to be filtered and searched. My first guess is that you are not using MySQL indexes correctly. Having said that you clearly need to regularly find the nth row in an ordered list which is something that SQL is not at all good at.

If you are looking to some form of in-memory database then you'll need something more sophisticated than REDIS. I would suggest you look at VoltDB which is very fast but not cheap.

If you would like to build your own in-memory store then you'll need to calculate memory use to see if it's feasible. You will need an index (discussed later in this answer) for each row you want to search or filter on along with the record for each user. However even for 10 million rows and 20 fields its still going to be less than 1Gb RAM which should be fine on modern computers.

Now for the data structures. I believe you are on the right track using maps to lists. I don't think the lists need to be sorted - you just need to be able to get the set of users for particular value. In fact sets may be more appropriate (again worth testing performance). Here is my suggestion to try (I've just added country and age fields - I assume you'll need others but it's a reasonable example to start with):

enum Country {
    ...
}

class User {
    String givenName;
    String familyName;
    int xp;
    Country country;
    int age;
}

class LeaderBoard {
    Set<User> users;
    Map<Integer, Set<User>> xpIndex;
    Map<Country, Set<User>> countryIndex;
    Map<Integer, Set<User>> ageIndex;
}

Each of the indices will need to be updated when a field changes. For example:

private setUserAge(User user, int age) {
    assert users.contains(user);
    assert ageIndex.get(user.getAge()).contains(user);
    ageIndex.get(user.getAge()).remove(user);
    if (!ageIndex.containsKey(age)) {
        ageIndex.put(age, new TreeSet<>());
    }
    ageIndex.get(age).add(user);
    user.setAge(age);
}

Getting all users, by rank, that satisfy a given combination can be done in a number of ways:

countryIndex.get(Country.Germany).stream()
    .filter(ageIndex.get(20)::contains)
    .sorted(User::compareRank)
    ...

or

SortedSet<User> germanUsers = new TreeSet<>(User::compareRank);
germanUsers.addAll(countryIndex.get(Country.Germany));
germanUsers.retainAll(ageIndex.get(20));

You'll need to check which of these is more efficient - I would guess the stream implementation will be. Also it can be easily converted to a paralellStream.

You mention a concern with update efficiency. I would be very surprised if this was an issue unless there were many updates a second. In general with these types of applications you will get many more reads than writes.

I see no reason to manually partition the indexes as you are suggesting unless you are going to have hundreds of millions of entries. Better would be to experiment with HashMap vs TreeMap for the concrete instantiation of the indices.

The next obvious enhancement if you need better performance is to multithread the application. That should not be too complex as you have relatively simple data structures to synchronize. Use of parallel streams in the searches helps of course (and you get them for free in Java 8).

So my recommendation is to go with these simple data structures and eek out performance using multithreading and adjusting the concrete implementations (e.g. hash functions) before trying anything more sophisticated.

like image 86
sprinter Avatar answered Oct 22 '22 20:10

sprinter


Although I am still in the middle of benchmarks, I am updating the status of the current development. Best performance rates come when using:

Map<Country, Map<Age, Map <TimingIdentifier, List<User>>>> (List is sorted)

Some notes on the keys: I added a Country called World in order to have an instance of the full leader-board country-independent (as if the Country filter is not selected). I did the same for Age (All-Ages) and TimeIdentifier (All-Time). TimeIdentifier key values are [All-Time, Month, Week, Day]

The above can be extended for other filters, so it can be applied for other scenarios as well. Map<Filter1,Map<Filter2,Map<Filter3,Map<Filter4 ..other Map Keys here..,List<User>>>>

Update: Instead of using multiple Map wrappers, a class used as a key in a single Map with the above fields is slightly faster. Of course, we need a multiton like pattern to create all available FilterCombination objects:

class FilterCombination {
    private int CountryId;
    private int AgeId;
    private int TimeId;
    ...
}

then we define the Map<FilterCombination, List<User>> (sorted List)

I could use a TreeSet but I didn't. Why? Basically, I was looking for an Order Statistic Tree (see here), but it seems there are not official Java implementations (see here). Probably this is the way to go VS sorted List due to inefficiency of List.add(index, Object) which is O(n). A LinkedList would be better for .add(index, Object) but unfortunately it is slow in getting the k-th element (ranking is O(n)). So, every structure has its pros and against for such a task.

At the moment, I ended up using a sorted List. The reason is that when adding an element to the sorted list, I use a slightly modified binary search algorithm (see here). The above method gives me current User's rank at the insertion phase (so no additional search query is required), it is O(logn + n) (binary searching index + List.add(index, Object)).

Is there any other structure that performs better that O(logn + n) for insert + get rank together?

*Of course if I need to ask for User's ranking at a later time, I will again do a binary search, based on User's XP (+ timestamp as you see below) and not Id, because now I cannot search via User-Id in a List).

**As a comparator I use the following criteria

1st: XP points

in case of a draw - 2nd criterion: timestamp of last XP update

so, it is highly possible that equalities in Sorted list will be very very few. And even more, I would't mind if two users with the same XP are ranked in reverse order (even with our sample data of some millions of games, I found very few ties, not including zero XPs for which I don't care at all).

An XP update requires some work and resources. Fortunately, the second comparison criteria improved significantly User search inside this List (binary search again), because, before updating User's XPs, I had to remove the previous entries for this User in the lists... but I am looking via her previous XPs and timestamps so it is log(n).

like image 23
Kostas Chalkias Avatar answered Oct 22 '22 21:10

Kostas Chalkias