Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bulk update is too slow

I am using pymongo to do a bulk update.
The names list below is a distinct list of names (each name might have mutiple documents in the collection)

Code 1:

bulk = db.collection.initialize_unordered_bulk_op()
for name in names:  
    bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
print bulk.execute()  

Code 2:

bulk = db.collection.initialize_unordered_bulk_op()
counter = 0
for name in names:  
    bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
    counter =counter + 1
    if (counter % 100 == 0):
        print bulk.execute()
        bulk = db.collection.initialize_unordered_bulk_op()
if (counter % 100 != 0):
    print bulk.execute()  

I have 50000 documents in my collection. If I get rid of the counter and if statement (Code 1), the code gets stuck! With the if statement (Code 2), I am assuming this operation shouldn't take more than a couple of minutes but it is taking way more than that! Can you please help me make it faster or am I wrong in my assumption?!

like image 922
amazingCodingExperience Avatar asked Sep 23 '16 05:09

amazingCodingExperience


1 Answers

You most likely forgot to add indexes to support your queries! This will trigger full collection scans for each of your operations which is boring slow (as you did realize).

The following code does test using update_many, and the bulk-stuff without and with indexes on the 'name' and 'A' field. The numbers you get speak for themselves.

Remark, I was not passionate enough to do this for 50000 without the indexes but for 10000 documents. Results for 10000 are:

  • without index and update_many: 38.6 seconds
  • without index and bulk update: 28.7 seconds
  • with index and update_many: 3.9 seconds
  • with index and bulk update: 0.52 seconds

For 50000 documents with added index it takes 2.67 seconds. I did run the test on a windows machine and mongo running on the same host in docker.

For more information about indexes see https://docs.mongodb.com/manual/indexes/#indexes. In short: Indexes are kept in RAM and allow for fast query and lookup of documents. Indexes have to specifically choose to match your queries.

from pymongo import MongoClient
import random
from timeit import timeit


col = MongoClient()['test']['test']

col.drop() # erase all documents in collection 'test'
docs = []

# initialize 10000 documents use a random number between 0 and 1 converted 
# to a string as name. For the documents with a name > 0.5 add the key A
for i in range(0, 10000):
    number = random.random()
    if number > 0.5:
        doc = {'name': str(number),
        'A': True}
    else:
        doc = {'name': str(number)}
    docs.append(doc)

col.insert_many(docs) # insert all documents into the collection
names = col.distinct('name') # get all distinct values for the key name from the collection


def update_with_update_many():
    for name in names:
        col.update_many({'A': {'$exists': False}, 'Name': name},
                        {'$set': {'B': 1, 'C': 2, 'D': 3}})

def update_with_bulk():
    bulk = col.initialize_unordered_bulk_op()
    for name in names:
        bulk.find({'A': {'$exists': False}, 'Name': name}).\
            update({'$set': {'B': 1, 'C': 2, 'D': 3}})
    bulk.execute()

print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))
col.create_index('A') # this adds an index on key A
col.create_index('Name') # this adds an index on key Name
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))
like image 72
squanto773 Avatar answered Sep 22 '22 17:09

squanto773