Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Upsert performance decreases with a growing collection (number of documents)

Use Case:

I'm consuming a REST Api which provides battle results of a video game. It is a team vs team online game and each team consists of 3 players who can pick different one from 100 different characters. I want to count the number of wins / losses and draws for each team combination. I get roughly 1000 battle results per second. I concatenate the character ids (ascending) of each team and then I save the wins/losses and draws for each combination.

My current implementation:

const combinationStatsSchema: Schema = new Schema({
  combination: { type: String, required: true, index: true },
  gameType: { type: String, required: true, index: true },
  wins: { type: Number, default: 0 },
  draws: { type: Number, default: 0 },
  losses: { type: Number, default: 0 },
  totalGames: { type: Number, default: 0, index: true },
  battleDate: { type: Date, index: true, required: true }
});

For each returned log I perform an upsert and send these queries in bulk (5-30 rows) to MongoDB:

const filter: any = { combination: log.teamDeck, gameType, battleDate };
if (battleType === BattleType.PvP) {
  filter.arenaId = log.arena.id;
}
const update: {} = { $inc: { draws, losses, wins, totalGames: 1 } };
combiStatsBulk.find(filter).upsert().updateOne(update);

My problem:

As long as I just have a few thousand entries in my collection combinationStats mongodb takes just 0-2% cpu. Once the collection has a couple million documents (which happens pretty quickly due to the amount of possible combinations) MongoDB constantly takes 50-100% cpu. Apparently my approach is not scalable at all.

My question:

Either of these options could be a solution to my above defined problem:

  1. Can I optimize the performance of my MongoDB solution described above so that it doesn't take that much CPU? (I already indexed the fields I filter on and I perform upserts in bulk). Would it help to create a hash (based on all my filter fields) which I could use for filtering the data then to improve performance?
  2. Is there a better database / technology suited to aggregate such data? I could imagine a couple more use cases where I want/need to increment a counter for a given identifier.

Edit: After khang commented that it might be related to the upsert performance I replaced my $inc with a $set and indeed the performance was equally "poor". Hence I tried the suggested find() and then manually update() approach but the results didn't become any better.

like image 732
kentor Avatar asked Dec 28 '17 02:12

kentor


2 Answers

Create a hash on your filter conditions:

I was able to reduce the CPU from 80-90% down to 1-5% and experienced a higher throughoutput.

Apparently the filter was the problem. Instead of filtering on these three conditions: { combination: log.teamDeck, gameType, battleDate } I created a 128bit hash in my node application. I used this hash for upserting and set the combination, gameType and battleDate as additional fields in my update Document.

For creating the hash I used the metrohash library, which can be found here: https://github.com/jandrewrogers/MetroHash . Unfortunately I cannot explain why the performance is so much better, especially since I indexed all my previous conditions.

like image 75
kentor Avatar answered Oct 17 '22 14:10

kentor


In (1.) you assert that you perform upserts in bulk. But based on how this seems to scale, you're probably sending too few rows into each batch. Consider doubling the batch size each time there's a doubling of stored rows. Please do post mongo's explain() query plan for your setup.

In (2.) you consider switching to, say, mysql or postgres. Yes, that would absolutely be a valid experiment. Again, be sure to post EXPLAIN output alongside your timing data.

There's only a million possible team compositions, and there's a distribution over those, with some being much more popular than others. You only need to maintain a million counters, which is not such a large number. However, doing 1e6 disk I/O's can take a while, especially if they are random reads. Consider moving away from a disk resident data structure, to which you might be doing frequent COMMITs, and switching to a memory resident hash or b-tree. It doesn't sound like ACID type of persistence guarantees are important to your application.

Also, once you have assembled "large" input batches, certainly more than a thousand and perhaps on the order of a million, do take care to sort the batch before processing. Then your counter maintenance problem just looks like merge-sort, either on internal memory or on external storage.

One principled approach to scaling your batches is to accumulate observations in some conveniently sized sorted memory buffer, and only release aggregate (counted) observations from that pipeline stage when number of distinct team compositions in the buffer is above some threshold K. Mongo or whatever would be the next stage in your pipeline. If K is much more than 1% of 1e6, then even a sequential scan of counters stored on disk would have a decent chance of finding useful update work to do on each disk block that is read.

like image 23
J_H Avatar answered Oct 17 '22 16:10

J_H