I need to ignore duplicate inserts when using insert_many with pymongo, where the duplicates are based on the index. I've seen this question asked on stackoverflow, but I haven't seen a useful answer.
Here's my code snippet:
try:
results = mongo_connection[db][collection].insert_many(documents, ordered=False, bypass_document_validation=True)
except pymongo.errors.BulkWriteError as e:
logger.error(e)
I would like the insert_many to ignore duplicates and not throw an exception (which fills up my error logs). Alternatively, is there a separate exception handler I could use, so that I can just ignore the errors. I miss "w=0"...
Thanks
To understand the INSERT IGNORE clause, you have to first see the simple INSERT command’s working. If you’re using the INSERT command to insert several information data to a table, MySQL suspends the transaction and generates an exception if an error happens throughout the processing.
In the grid view, you can enter the records without typing any query. So, open the grid view of table “employee” and add some records in it as shown below. We have entered all the unique records without any duplicates.
Identifying duplicate records and deleting them from either table is necessary. This section will elaborate on how to avoid duplicate data from appearing inside a table and how to eliminate current duplicate records.
db.collection.insertMany () can be used inside multi-document transactions. In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transactions should not be a replacement for effective schema design.
You can deal with this by inspecting the errors produced with BulkWriteError
. This is actually an "object" which has several properties. The interesting parts are in details
:
import pymongo
from bson.json_util import dumps
from pymongo import MongoClient
client = MongoClient()
db = client.test
collection = db.duptest
docs = [{ '_id': 1 }, { '_id': 1 },{ '_id': 2 }]
try:
result = collection.insert_many(docs,ordered=False)
except pymongo.errors.BulkWriteError as e:
print e.details['writeErrors']
On a first run, this will give the list of errors under e.details['writeErrors']
:
[
{
'index': 1,
'code': 11000,
'errmsg': u'E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }',
'op': {'_id': 1}
}
]
On a second run, you see three errors because all items existed:
[
{
"index": 0,
"code": 11000,
"errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }",
"op": {"_id": 1}
},
{
"index": 1,
"code": 11000,
"errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }",
"op": {"_id": 1}
},
{
"index": 2,
"code": 11000,
"errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 2 }",
"op": {"_id": 2}
}
]
So all you need do is filter the array for entries with "code": 11000
and then only "panic" when something else is in there
panic = filter(lambda x: x['code'] != 11000, e.details['writeErrors'])
if len(panic) > 0:
print "really panic"
That gives you a mechanism for ignoring the duplicate key errors but of course paying attention to something that is actually a problem.
Adding more to Neil's solution.
Having 'ordered=False, bypass_document_validation=True' params allows new pending insertion to occur even on duplicate exception.
from pymongo import MongoClient, errors
DB_CLIENT = MongoClient()
MY_DB = DB_CLIENT['my_db']
TEST_COLL = MY_DB.dup_test_coll
doc_list = [
{
"_id": "82aced0eeab2467c93d04a9f72bf91e1",
"name": "shakeel"
},
{
"_id": "82aced0eeab2467c93d04a9f72bf91e1", # duplicate error: 11000
"name": "shakeel"
},
{
"_id": "fab9816677774ca6ab6d86fc7b40dc62", # this new doc gets inserted
"name": "abc"
}
]
try:
# inserts new documents even on error
TEST_COLL.insert_many(doc_list, ordered=False, bypass_document_validation=True)
except errors.BulkWriteError as e:
print(f"Articles bulk insertion error {e}")
panic_list = list(filter(lambda x: x['code'] != 11000, e.details['writeErrors']))
if len(panic_list) > 0:
print(f"these are not duplicate errors {panic_list}")
And since we are talking about duplicates its worth checking this solution as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With