I'm performing a repetitive update operation to add documents into my MongoDB as part of some performance evaluation. I've discovered a huge non-linearity in execution time based on the number of updates (w/ upserts) I'm performing:
Looping with the following command in Python...
collection.update({'timestamp': x}, {'$set': {'value1':y, v1 : y/2, v2 : y/4}}, upsert=True)
Gives me these results...
500 document upserts 2 seconds.
1000 document upserts 3 seconds.
2000 document upserts 3 seconds.
4000 document upserts 6 seconds.
8000 document upserts 14 seconds.
16000 document upserts 77 seconds.
32000 document upserts 280 seconds.
Notice how after 8k document updates the performance starts to rapidly degrade, and by 32k document updates we're seeing a 6x reduction in throughput. Why is this? It seems strange that "manually" running 4k document updates 8 times in a row would be 6x faster than having Python perform them all consecutively.
I've seen that in mongostats I'm getting a ridiculously high locked db ratio (>100%) and top is showing me >85% CPU usage when this is running. I've got an i7 processor with 4 cores available to the VM.
You should put an ascending index on your "timestamp" field:
collection.ensure_index("timestamp") # shorthand for single-key, ascending index
If this index should contain unique values:
collection.ensure_index("timestamp", unique=True)
Since the spec is not indexed and you are performing updates, the database has to check every document in the collection to see if any documents already exist with that spec. When you do this for 500 documents (in a blank collection), the effects are not so bad...but when you do it for 32k, it does something like this (in the worst case):
document 1 - assuming blank collection, definitely gets inserted
document 2 - check document 1, update or insert occurs
document 3 - check documents 1-2, update or insert occurs
...etc...
document 32000 - check documents 1-31999, update or insert
When you add the index, the database no longer has to check every document in the collection; instead, it can use the index to find any possible matches much more quickly using a B-tree cursor instead of a basic cursor.
You should compare the results of collection.find({"timestamp": x}).explain()
with and without the index (note you may need to use the hint()
method to force it to use the index). The critical factor is how many documents you have to iterate over (the "nscanned" result of explain()
) versus how many documents match your query (the "n" key). If the db only has to scan exactly what matches or close to that, that is very efficient; if you scan 32000 items but only found 1 or a handful of matches, that is terribly inefficient, especially if the db has to do something like that for each and every upsert.
A notable wrinkle for you to double check- since you have not set multi=True
in your update
call, if an update operation finds a matching document, it will update just it and not continue to check the entire collection.
Sorry for the link spam, but these are all must-reads:
http://docs.mongodb.org/manual/core/indexes/
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.ensure_index
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.update
http://docs.mongodb.org/manual/reference/method/cursor.explain/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With