Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to skip over existing _id's for insert_many in Pymongo 3.0?

I'm updating a DB that has several million documents with less than 10 _id collisions.

I'm currently using the PyMongo module to do batch inserts using insert_many by:

  1. Querying the db to see if the _id exists
  2. Then adding the document to an array if _id doesn't exist
  3. Insert to the database using insert_many, 1000 documents at a time.

There are only about 10 collisions out of several million documents and I'm currently querying the database for each _id. I think that I could cut down on overall insert time by a day or two if I could cut out the query process.

Is there something similar to upsert perhaps that only inserts a document if it doesn't exist?

like image 956
SLee Avatar asked Dec 20 '22 02:12

SLee


2 Answers

The better way to handle this and also "inserting/updating" many documents in an efficient way is to use the Bulk Operations API to submit everything in "batches" with effecient sending of all and receiving a "singular response" in confirmation.

This can be handled in two ways.

Firstly to ignore any "duplicate errors" on the primary key or other indexes then you can use an "UnOrdered" form of operation:

bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=False)
for doc in docs:
    bulk.insert(doc)

response = bulk.execute()

The "UnOrdered" or false argument there means that the operations can both execute in any order and that the "whole" batch will be completed with any actual errors simply being "reported" in the response. So that is one way to basically "ignore" the duplicates and move along.

The alternate approach is much the same but using the "upsert" functionality along with $setOnInsert:

bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=True)
for doc in docs:
    bulk.find({ "_id": doc["_id"] }).upsert().updateOne({
        "$setOnInsert": doc
    })

response = bulk.execute()

Whereby the "query" portion in .find() is used to query for the presence of a document using the "primary key" or alternately the "unique keys" of the document. Where no match is found an "upsert" occurs with a new doccument created. Since all the modification content is within $setOnInsert then the document fields are only modified here when an "upsert" occurs. Otherwise while the document is "matched" nothing is actually changed with respect to the data kept under this operator.

The "Ordered" in this case means that every statement is actually committed in the "same" order it was created in. Also any "errors" here will halt the update ( at the point where the error occurred ) so that no more operations will be committed. It's optional, but probably advised for normal "dupliate" behaviour where later statements "duplicate" the data of a previous one.

So for more efficient writes, the general idea is to use the "Bulk" API and build your actions accordingly. The choice here really comes down to whether the "order of insertion" from the source is important to you or not.

Of course the same "ordered"=False operation applies to insert_many which actually uses "Bulk" operations in the newer driver releases. But you will get more flexibilty from sticking with the general interface which can "mix" operations wit a simple API.

like image 85
Blakes Seven Avatar answered May 06 '23 09:05

Blakes Seven


While Blakes' answer is great, for most of cases it's fine to use ordered=False argument and catch BulkWriteError in case of duplicates.

try:
    collection.insert_many(data, ordered=False)
except BulkWriteError:
    logger.info('Duplicates were found.')
like image 45
stasdeep Avatar answered May 06 '23 08:05

stasdeep