We are building a massive multi-player educational game with some millions of entries in the leader-board (based on aggregated XPs gained). After a game finishes, we need to show the leaderboard and how this player/student is ranked. But there are a couple of filters for this leaderboard (global/by country, by month/year/today, by age etc) that can be mixed together e.g. 'Get me the leaderboard <code>for my Country</code> <code>for the last month</code>'. Number of combinations is ~20. My problem is how to store such a structure that is updated regularly; recalculation of rankings must be done after each game. A typical full leaderboard at the moment has ~5 millions of entries for players coming from >150 countries. <ol> <li>I used to have a MySQL Cluster Table (userid, xps, countryid) with 3 nodes, but ordering by XPs (either in DBMS or application which required all data from DB) proven to be too slow as numbers got bigger (>20K of users). This is an interesting post but again half a second for each query is too much.</li> <li>Then we used REDIS (see this post), but filtering is the problem here. We used separate lists for TOP 5 and the rest. TOP 5 was updated instantly, for the rest there was some delay of 20-30 minutes. We in fact ranked this user based on a cached instance of the Leaderboard (using the real XPs though, not the cached), so this was acceptable. Real-time on non-Top5 is not a prerequisite. This is fine for one global ranking, but how to filter the results based on month and/or country and/or age. Do we need to keep a list for every filtering combination?</li> <li>We also tested custom structures in Java (using it as a Java caching server similar in functionality with REDIS), still experimenting with it. Which is the best combination of structures to achieve our goal? We ended up using one list per filtering combination e.g. <code>Map<FilteringCombination, SortedList<User>></code> and then doing binary search to the list of a specific key. This way, a finished game requires a couple of insertions say X, but it requires X*NumOfPlayers space, which is X times more than keeping a single list (not sure if this can fit to memory but we can always create a cluster here by splitting combinations to different servers). There is an issue here on how to rebuild the cache in case of failure, but that is another problem we can deal with.</li> <li>Extending the above method, we might slightly improve performance if we define scoring buckets inside each list (eg a bucket for 0-100xp, another for 101 - 1000xp, another for 1001 - 10000xp etc). The bucket splitting policy will be based on the players' xp distribution in our game. It's true that this distribution is dynamic in real world, but we have seen that after a few months changes are minor, having in mind that XPs are always increasing but new users are coming as well. </li> <li>We are also testing Cassandra's natural ordering by utilizing clustering keys and white-rows feature, although we know that having some millions of rows may not be easy to handle.</li> </ol> All in all, that is what we need to achieve. If a user (let's name her UserX) is not included in the Top5 list, we need to show this user's ranking together with some surrounding players (eg 2 above and 2 below) as the example below: <pre class="prettyprint"><code> Global TOP 5 My Global Ranking (425) My Country Ranking Other Rankings 1. karen (12000xp) 423. george 1. david 2. greg (11280xp) 424. nancy 2. donald 3. philips (10293xp) **425. UserX** 3. susan 4. jason (9800xp) 426. rebecca **4. UserX** 5. barbara (8000xp) 427. james 5. teresa </code></pre> I've studied many SO or other posts, but still cannot find a solution for efficiently updating and filtering large Leaderboard tables. Which one candidate solution would you choose and what are the possible performance improvements (space + memory + (Insertion/Searching CPU cost))?

Although I am still in the middle of benchmarks, I am updating the status of the current development. Best performance rates come when using: <code>Map<Country, Map<Age, Map <TimingIdentifier, List<User>>>></code> (List is sorted) Some notes on the keys: I added a Country called World in order to have an instance of the full leader-board country-independent (as if the Country filter is not selected). I did the same for Age (All-Ages) and TimeIdentifier (All-Time). TimeIdentifier key values are [All-Time, Month, Week, Day] The above can be extended for other filters, so it can be applied for other scenarios as well. <code>Map<Filter1,Map<Filter2,Map<Filter3,Map<Filter4 ..other Map Keys here..,List<User>>>></code> Update: Instead of using multiple Map wrappers, a class used as a key in a single Map with the above fields is slightly faster. Of course, we need a multiton like pattern to create all available FilterCombination objects: <pre class="prettyprint"><code>class FilterCombination { private int CountryId; private int AgeId; private int TimeId; ... } </code></pre> then we define the <code>Map<FilterCombination, List<User>></code> (sorted List) I could use a TreeSet but I didn't. Why? Basically, I was looking for an Order Statistic Tree (see here), but it seems there are not official Java implementations (see here). Probably this is the way to go VS sorted List due to inefficiency of <code>List.add(index, Object)</code> which is O(n). A LinkedList would be better for <code>.add(index, Object)</code> but unfortunately it is slow in getting the k-th element (ranking is O(n)). So, every structure has its pros and against for such a task. At the moment, I ended up using a sorted List. The reason is that when adding an element to the sorted list, I use a slightly modified binary search algorithm (see here). The above method gives me current User's rank at the insertion phase (so no additional search query is required), it is O(logn + n) (binary searching index + List.add(index, Object)). Is there any other structure that performs better that O(logn + n) for insert + get rank together? *Of course if I need to ask for User's ranking at a later time, I will again do a binary search, based on User's XP (+ timestamp as you see below) and not Id, because now I cannot search via User-Id in a List). **As a comparator I use the following criteria 1st: XP points in case of a draw - 2nd criterion: timestamp of last XP update so, it is highly possible that equalities in Sorted list will be very very few. And even more, I would't mind if two users with the same XP are ranked in reverse order (even with our sample data of some millions of games, I found very few ties, not including zero XPs for which I don't care at all). An XP update requires some work and resources. Fortunately, the second comparison criteria improved significantly User search inside this List (binary search again), because, before updating User's XPs, I had to remove the previous entries for this User in the lists... but I am looking via her previous XPs and timestamps so it is log(n).

Huge Leaderboard ranking with filtering

Tags:

java

database

caching

redis

cassandra

We are building a massive multi-player educational game with some millions of entries in the leader-board (based on aggregated XPs gained). After a game finishes, we need to show the leaderboard and how this player/student is ranked. But there are a couple of filters for this leaderboard (global/by country, by month/year/today, by age etc) that can be mixed together e.g. 'Get me the leaderboard for my Country for the last month'. Number of combinations is ~20.

My problem is how to store such a structure that is updated regularly; recalculation of rankings must be done after each game. A typical full leaderboard at the moment has ~5 millions of entries for players coming from >150 countries.

I used to have a MySQL Cluster Table (userid, xps, countryid) with 3 nodes, but ordering by XPs (either in DBMS or application which required all data from DB) proven to be too slow as numbers got bigger (>20K of users). This is an interesting post but again half a second for each query is too much.
Then we used REDIS (see this post), but filtering is the problem here. We used separate lists for TOP 5 and the rest. TOP 5 was updated instantly, for the rest there was some delay of 20-30 minutes. We in fact ranked this user based on a cached instance of the Leaderboard (using the real XPs though, not the cached), so this was acceptable. Real-time on non-Top5 is not a prerequisite. This is fine for one global ranking, but how to filter the results based on month and/or country and/or age. Do we need to keep a list for every filtering combination?
We also tested custom structures in Java (using it as a Java caching server similar in functionality with REDIS), still experimenting with it. Which is the best combination of structures to achieve our goal? We ended up using one list per filtering combination e.g. Map<FilteringCombination, SortedList<User>> and then doing binary search to the list of a specific key. This way, a finished game requires a couple of insertions say X, but it requires X*NumOfPlayers space, which is X times more than keeping a single list (not sure if this can fit to memory but we can always create a cluster here by splitting combinations to different servers). There is an issue here on how to rebuild the cache in case of failure, but that is another problem we can deal with.
Extending the above method, we might slightly improve performance if we define scoring buckets inside each list (eg a bucket for 0-100xp, another for 101 - 1000xp, another for 1001 - 10000xp etc). The bucket splitting policy will be based on the players' xp distribution in our game. It's true that this distribution is dynamic in real world, but we have seen that after a few months changes are minor, having in mind that XPs are always increasing but new users are coming as well.
We are also testing Cassandra's natural ordering by utilizing clustering keys and white-rows feature, although we know that having some millions of rows may not be easy to handle.

All in all, that is what we need to achieve. If a user (let's name her UserX) is not included in the Top5 list, we need to show this user's ranking together with some surrounding players (eg 2 above and 2 below) as the example below:

    Global TOP 5        My Global Ranking (425)   My Country Ranking     Other Rankings      
1. karen (12000xp)          423. george              1. david    
2. greg (11280xp)           424. nancy               2. donald 
3. philips (10293xp)      **425. UserX**             3. susan
4. jason (9800xp)           426. rebecca           **4. UserX** 
5. barbara (8000xp)         427. james               5. teresa

I've studied many SO or other posts, but still cannot find a solution for efficiently updating and filtering large Leaderboard tables. Which one candidate solution would you choose and what are the possible performance improvements (space + memory + (Insertion/Searching CPU cost))?

984

asked Dec 30 '14 13:12

Kostas Chalkias

2 Answers

That's a very interesting problem - thanks for posting. In general databases excel at this type of problem in which there is large amounts of data that needs to be filtered and searched. My first guess is that you are not using MySQL indexes correctly. Having said that you clearly need to regularly find the nth row in an ordered list which is something that SQL is not at all good at.

If you are looking to some form of in-memory database then you'll need something more sophisticated than REDIS. I would suggest you look at VoltDB which is very fast but not cheap.

If you would like to build your own in-memory store then you'll need to calculate memory use to see if it's feasible. You will need an index (discussed later in this answer) for each row you want to search or filter on along with the record for each user. However even for 10 million rows and 20 fields its still going to be less than 1Gb RAM which should be fine on modern computers.

Now for the data structures. I believe you are on the right track using maps to lists. I don't think the lists need to be sorted - you just need to be able to get the set of users for particular value. In fact sets may be more appropriate (again worth testing performance). Here is my suggestion to try (I've just added country and age fields - I assume you'll need others but it's a reasonable example to start with):

enum Country {
    ...
}

class User {
    String givenName;
    String familyName;
    int xp;
    Country country;
    int age;
}

class LeaderBoard {
    Set<User> users;
    Map<Integer, Set<User>> xpIndex;
    Map<Country, Set<User>> countryIndex;
    Map<Integer, Set<User>> ageIndex;
}

Each of the indices will need to be updated when a field changes. For example:

private setUserAge(User user, int age) {
    assert users.contains(user);
    assert ageIndex.get(user.getAge()).contains(user);
    ageIndex.get(user.getAge()).remove(user);
    if (!ageIndex.containsKey(age)) {
        ageIndex.put(age, new TreeSet<>());
    }
    ageIndex.get(age).add(user);
    user.setAge(age);
}

Getting all users, by rank, that satisfy a given combination can be done in a number of ways:

countryIndex.get(Country.Germany).stream()
    .filter(ageIndex.get(20)::contains)
    .sorted(User::compareRank)
    ...

SortedSet<User> germanUsers = new TreeSet<>(User::compareRank);
germanUsers.addAll(countryIndex.get(Country.Germany));
germanUsers.retainAll(ageIndex.get(20));

You'll need to check which of these is more efficient - I would guess the stream implementation will be. Also it can be easily converted to a paralellStream.

You mention a concern with update efficiency. I would be very surprised if this was an issue unless there were many updates a second. In general with these types of applications you will get many more reads than writes.

I see no reason to manually partition the indexes as you are suggesting unless you are going to have hundreds of millions of entries. Better would be to experiment with HashMap vs TreeMap for the concrete instantiation of the indices.

The next obvious enhancement if you need better performance is to multithread the application. That should not be too complex as you have relatively simple data structures to synchronize. Use of parallel streams in the searches helps of course (and you get them for free in Java 8).

So my recommendation is to go with these simple data structures and eek out performance using multithreading and adjusting the concrete implementations (e.g. hash functions) before trying anything more sophisticated.

answered Oct 22 '22 20:10

sprinter

Although I am still in the middle of benchmarks, I am updating the status of the current development. Best performance rates come when using:

Map<Country, Map<Age, Map <TimingIdentifier, List<User>>>> (List is sorted)

Some notes on the keys: I added a Country called World in order to have an instance of the full leader-board country-independent (as if the Country filter is not selected). I did the same for Age (All-Ages) and TimeIdentifier (All-Time). TimeIdentifier key values are [All-Time, Month, Week, Day]

The above can be extended for other filters, so it can be applied for other scenarios as well. Map<Filter1,Map<Filter2,Map<Filter3,Map<Filter4 ..other Map Keys here..,List<User>>>>

Update: Instead of using multiple Map wrappers, a class used as a key in a single Map with the above fields is slightly faster. Of course, we need a multiton like pattern to create all available FilterCombination objects:

class FilterCombination {
    private int CountryId;
    private int AgeId;
    private int TimeId;
    ...
}

then we define the Map<FilterCombination, List<User>> (sorted List)

I could use a TreeSet but I didn't. Why? Basically, I was looking for an Order Statistic Tree (see here), but it seems there are not official Java implementations (see here). Probably this is the way to go VS sorted List due to inefficiency of List.add(index, Object) which is O(n). A LinkedList would be better for .add(index, Object) but unfortunately it is slow in getting the k-th element (ranking is O(n)). So, every structure has its pros and against for such a task.

At the moment, I ended up using a sorted List. The reason is that when adding an element to the sorted list, I use a slightly modified binary search algorithm (see here). The above method gives me current User's rank at the insertion phase (so no additional search query is required), it is O(logn + n) (binary searching index + List.add(index, Object)).

Is there any other structure that performs better that O(logn + n) for insert + get rank together?

*Of course if I need to ask for User's ranking at a later time, I will again do a binary search, based on User's XP (+ timestamp as you see below) and not Id, because now I cannot search via User-Id in a List).

**As a comparator I use the following criteria

1st: XP points

in case of a draw - 2nd criterion: timestamp of last XP update

so, it is highly possible that equalities in Sorted list will be very very few. And even more, I would't mind if two users with the same XP are ranked in reverse order (even with our sample data of some millions of games, I found very few ties, not including zero XPs for which I don't care at all).

An XP update requires some work and resources. Fortunately, the second comparison criteria improved significantly User search inside this List (binary search again), because, before updating User's XPs, I had to remove the previous entries for this User in the lists... but I am looking via her previous XPs and timestamps so it is log(n).

answered Oct 22 '22 21:10

Kostas Chalkias

Related questions
                            
                                Apache MINA SFTP - limit the directory structure that the user sees
                            
                                How to create a file like the MNIST dataset
                            
                                java.lang.StackOverflowError when building Sencha/ ExtJS 5 project
                            
                                Java, Polymorphism, Static Typing and the "QueryInterface" pattern
                            
                                PDF PrinterJob get Job status
                            
                                How to hide password from Post request url and browser dump
                            
                                How to write mockito matcher for byte[]?
                            
                                Efficiently read file from URL into byte[] in Java
                            
                                Heuristic for A*-Algorithm with irregular distances between nodes
                            
                                Printing a 1800 x 1200 image on 4 x 6 paper using Java
                            
                                maven can't find path specified
                            
                                Traverse JSON data in JAVA
                            
                                Tomcat need to restart after database restart
                            
                                Android Geofencing with Google Play Services 6.5.87
                            
                                ListView with Buttons and static String Array
                            
                                How combine two strings with random position android
                            
                                How to override the properties file value through command line arguments?
                            
                                Scala Enumeration ValueSet.isEmpty slow
                            
                                Java type inference with lower bounded types
                            
                                Retrieving current capslock state in Java [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Huge Leaderboard ranking with filtering

Tags:

java

database

caching

redis

cassandra

Kostas Chalkias

People also ask

2 Answers

sprinter

Kostas Chalkias

Recent Activity

Donate For Us