Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Ignore Duplicate Key Errors Safely Using insert_many

I need to ignore duplicate inserts when using insert_many with pymongo, where the duplicates are based on the index. I've seen this question asked on stackoverflow, but I haven't seen a useful answer.

Here's my code snippet:

try:
    results = mongo_connection[db][collection].insert_many(documents, ordered=False, bypass_document_validation=True)
except pymongo.errors.BulkWriteError as e:
    logger.error(e)

I would like the insert_many to ignore duplicates and not throw an exception (which fills up my error logs). Alternatively, is there a separate exception handler I could use, so that I can just ignore the errors. I miss "w=0"...

Thanks

like image 872
vgoklani Avatar asked Jun 30 '17 03:06

vgoklani


People also ask

What is the insert IGNORE clause in MySQL?

To understand the INSERT IGNORE clause, you have to first see the simple INSERT command’s working. If you’re using the INSERT command to insert several information data to a table, MySQL suspends the transaction and generates an exception if an error happens throughout the processing.

How to enter Unique Records without duplicates in MySQL?

In the grid view, you can enter the records without typing any query. So, open the grid view of table “employee” and add some records in it as shown below. We have entered all the unique records without any duplicates.

Why is it necessary to identify and delete duplicate records?

Identifying duplicate records and deleting them from either table is necessary. This section will elaborate on how to avoid duplicate data from appearing inside a table and how to eliminate current duplicate records.

Can insertmany() be used inside a multi-document transaction?

db.collection.insertMany () can be used inside multi-document transactions. In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transactions should not be a replacement for effective schema design.


2 Answers

You can deal with this by inspecting the errors produced with BulkWriteError. This is actually an "object" which has several properties. The interesting parts are in details:

import pymongo
from bson.json_util import dumps
from pymongo import MongoClient
client = MongoClient()
db = client.test

collection = db.duptest

docs = [{ '_id': 1 }, { '_id': 1 },{ '_id': 2 }]


try:
  result = collection.insert_many(docs,ordered=False)

except pymongo.errors.BulkWriteError as e:
  print e.details['writeErrors']

On a first run, this will give the list of errors under e.details['writeErrors']:

[
  { 
    'index': 1,
    'code': 11000, 
    'errmsg': u'E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }', 
    'op': {'_id': 1}
  }
]

On a second run, you see three errors because all items existed:

[
  {
    "index": 0,
    "code": 11000,
    "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }", 
    "op": {"_id": 1}
   }, 
   {
     "index": 1,
     "code": 11000,
     "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 1 }",
     "op": {"_id": 1}
   },
   {
     "index": 2,
     "code": 11000,
     "errmsg": "E11000 duplicate key error collection: test.duptest index: _id_ dup key: { : 2 }",
     "op": {"_id": 2}
   }
]

So all you need do is filter the array for entries with "code": 11000 and then only "panic" when something else is in there

panic = filter(lambda x: x['code'] != 11000, e.details['writeErrors'])

if len(panic) > 0:
  print "really panic"

That gives you a mechanism for ignoring the duplicate key errors but of course paying attention to something that is actually a problem.

like image 97
Neil Lunn Avatar answered Oct 10 '22 11:10

Neil Lunn


Adding more to Neil's solution.

Having 'ordered=False, bypass_document_validation=True' params allows new pending insertion to occur even on duplicate exception.

from pymongo import MongoClient, errors

DB_CLIENT = MongoClient()
MY_DB = DB_CLIENT['my_db']
TEST_COLL = MY_DB.dup_test_coll

doc_list = [
    {
        "_id": "82aced0eeab2467c93d04a9f72bf91e1",
        "name": "shakeel"
    },
    {
        "_id": "82aced0eeab2467c93d04a9f72bf91e1",  # duplicate error: 11000
        "name": "shakeel"
    },
    {
        "_id": "fab9816677774ca6ab6d86fc7b40dc62",  # this new doc gets inserted
        "name": "abc"
    }
]

try:
    # inserts new documents even on error
    TEST_COLL.insert_many(doc_list, ordered=False, bypass_document_validation=True)
except errors.BulkWriteError as e:
    print(f"Articles bulk insertion error {e}")

    panic_list = list(filter(lambda x: x['code'] != 11000, e.details['writeErrors']))
    if len(panic_list) > 0:
        print(f"these are not duplicate errors {panic_list}")

And since we are talking about duplicates its worth checking this solution as well.

like image 3
Shakeel Avatar answered Oct 10 '22 12:10

Shakeel