I'm updating a DB that has several million documents with less than 10 _id collisions.
I'm currently using the PyMongo module to do batch inserts using insert_many by:
There are only about 10 collisions out of several million documents and I'm currently querying the database for each _id. I think that I could cut down on overall insert time by a day or two if I could cut out the query process.
Is there something similar to upsert perhaps that only inserts a document if it doesn't exist?
The better way to handle this and also "inserting/updating" many documents in an efficient way is to use the Bulk Operations API to submit everything in "batches" with effecient sending of all and receiving a "singular response" in confirmation.
This can be handled in two ways.
Firstly to ignore any "duplicate errors" on the primary key or other indexes then you can use an "UnOrdered" form of operation:
bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=False)
for doc in docs:
bulk.insert(doc)
response = bulk.execute()
The "UnOrdered" or false
argument there means that the operations can both execute in any order and that the "whole" batch will be completed with any actual errors simply being "reported" in the response. So that is one way to basically "ignore" the duplicates and move along.
The alternate approach is much the same but using the "upsert" functionality along with $setOnInsert
:
bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=True)
for doc in docs:
bulk.find({ "_id": doc["_id"] }).upsert().updateOne({
"$setOnInsert": doc
})
response = bulk.execute()
Whereby the "query" portion in .find()
is used to query for the presence of a document using the "primary key" or alternately the "unique keys" of the document. Where no match is found an "upsert" occurs with a new doccument created. Since all the modification content is within $setOnInsert
then the document fields are only modified here when an "upsert" occurs. Otherwise while the document is "matched" nothing is actually changed with respect to the data kept under this operator.
The "Ordered" in this case means that every statement is actually committed in the "same" order it was created in. Also any "errors" here will halt the update ( at the point where the error occurred ) so that no more operations will be committed. It's optional, but probably advised for normal "dupliate" behaviour where later statements "duplicate" the data of a previous one.
So for more efficient writes, the general idea is to use the "Bulk" API and build your actions accordingly. The choice here really comes down to whether the "order of insertion" from the source is important to you or not.
Of course the same "ordered"=False
operation applies to insert_many
which actually uses "Bulk" operations in the newer driver releases. But you will get more flexibilty from sticking with the general interface which can "mix" operations wit a simple API.
While Blakes' answer is great, for most of cases it's fine to use ordered=False
argument and catch BulkWriteError
in case of duplicates.
try:
collection.insert_many(data, ordered=False)
except BulkWriteError:
logger.info('Duplicates were found.')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With