Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB Update-Upsert Performance Barrier (Performance falls off a cliff)

I'm performing a repetitive update operation to add documents into my MongoDB as part of some performance evaluation. I've discovered a huge non-linearity in execution time based on the number of updates (w/ upserts) I'm performing:

Looping with the following command in Python...

collection.update({'timestamp': x}, {'$set': {'value1':y, v1 : y/2, v2 : y/4}}, upsert=True)

Gives me these results...

500 document upserts 2 seconds.
1000 document upserts 3 seconds.
2000 document upserts 3 seconds.
4000 document upserts 6 seconds.
8000 document upserts 14 seconds.
16000 document upserts 77 seconds.
32000 document upserts 280 seconds.

Notice how after 8k document updates the performance starts to rapidly degrade, and by 32k document updates we're seeing a 6x reduction in throughput. Why is this? It seems strange that "manually" running 4k document updates 8 times in a row would be 6x faster than having Python perform them all consecutively.

I've seen that in mongostats I'm getting a ridiculously high locked db ratio (>100%) and top is showing me >85% CPU usage when this is running. I've got an i7 processor with 4 cores available to the VM.

like image 285
EnemyBagJones Avatar asked Dec 16 '22 01:12

EnemyBagJones


1 Answers

You should put an ascending index on your "timestamp" field:

collection.ensure_index("timestamp")  # shorthand for single-key, ascending index

If this index should contain unique values:

collection.ensure_index("timestamp", unique=True)

Since the spec is not indexed and you are performing updates, the database has to check every document in the collection to see if any documents already exist with that spec. When you do this for 500 documents (in a blank collection), the effects are not so bad...but when you do it for 32k, it does something like this (in the worst case):

document 1 - assuming blank collection, definitely gets inserted

document 2 - check document 1, update or insert occurs

document 3 - check documents 1-2, update or insert occurs

...etc...

document 32000 - check documents 1-31999, update or insert

When you add the index, the database no longer has to check every document in the collection; instead, it can use the index to find any possible matches much more quickly using a B-tree cursor instead of a basic cursor.

You should compare the results of collection.find({"timestamp": x}).explain() with and without the index (note you may need to use the hint() method to force it to use the index). The critical factor is how many documents you have to iterate over (the "nscanned" result of explain()) versus how many documents match your query (the "n" key). If the db only has to scan exactly what matches or close to that, that is very efficient; if you scan 32000 items but only found 1 or a handful of matches, that is terribly inefficient, especially if the db has to do something like that for each and every upsert.

A notable wrinkle for you to double check- since you have not set multi=True in your update call, if an update operation finds a matching document, it will update just it and not continue to check the entire collection.

Sorry for the link spam, but these are all must-reads:

http://docs.mongodb.org/manual/core/indexes/

http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.ensure_index

http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.update

http://docs.mongodb.org/manual/reference/method/cursor.explain/

like image 145
mdscruggs Avatar answered May 13 '23 04:05

mdscruggs