Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to read and update mongodb documents using pymongo

iam trying to read a mongodb collection document by document in order to fetch every record encrypt some of fields in the record and put it back to database.

for record in coll.find():
    #modifying record here
    coll.update(record)

this is causing a serious problem i.e already updated documents are read again by cursor and same document is processed again in loop (same document is trying to update again)

hope this may be one of the solution to the problem.

list_coll = [record for record in coll.find()]
for rec in list_coll:
   #modifying record
   coll.update(rec)

but is this the best way of doing? i.e what happens if the collection is large ? can large list_coll causes ram overflow? kindly suggest me a best way of doing it.

thanks

like image 686
wudpecker Avatar asked Aug 25 '14 11:08

wudpecker


2 Answers

You want the "Bulk Operations API" from MongoDB. Mostly introduced with MongoDB 2.6, so a compelling reason to be upgrading if you currently have not.

bulk = db.coll.initialize_ordered_bulk_op()
counter = 0

for record in coll.find(snapshot=True):
    # now process in bulk
    # calc value first
    bulk.find({ '_id': record['_id'] }).update({ '$set': { 'field': newValue } })
    counter += 1

    if counter % 1000 == 0:
        bulk.execute()
        bulk = db.coll.initialize_ordered_bulk_op()

if counter % 1000 != 0:
    bulk.execute()

Much better as you are not sending "every" request to the server, just once in every 1000 requests. The "Bulk API" actually sorts this out for you somewhat, but really you want to "manage" this a little better and not consume too much memory in your app.

Way of the future. Use it.

like image 141
Neil Lunn Avatar answered Oct 11 '22 12:10

Neil Lunn


If your collection isn't sharded you can isolate your find cursor from seeing the same doc again after it's updated by using the snapshot parameter:

for record in coll.find(snapshot = True):
    #modifying record here
    coll.update(record)

If your collection is sharded, keep a hash variable of the _id values that you've already updated and then check that list before you modify each record to ensure you don't update the same one twice.

like image 20
JohnnyHK Avatar answered Oct 11 '22 12:10

JohnnyHK