How do I bulk update/insert in mongoDb with pymongo/pandas.
The error I get is batch op errors occurred
I reason I get is because I set the "_id"
, which I want to do. I code runs fine on first run, but on second run it fails. I want to use pandas in workflow. The data does have a datetime object.
The syntax is completely different for upsert = True
, with Update
. An efficient solution with update
would be helpful, where "_id"
or "qid"
could be set. But, there are python datetime objects!
InSQL = 'SELECT * from database2.table2 '
sqlOut = pd.read_sql(InSQL,cxn)
sqlOut['_id'] = "20170101" + ":"+ sqlOut['Var']
dfOut = sqlOut.to_json(orient='records',date_format='iso' )
try:
db["test"].insert_many(json.loads(dfOut))
except Exception as e: print e
I have given a 50pt bounty, which expired, with no answer. Hmm...
Multiple documents can be inserted at a time in MongoDB using bulk insert operation where an array of documents is passed to the insert method as parameter.
Update Collection You can update a record, or document as it is called in MongoDB, by using the update_one() method. The first parameter of the update_one() method is a query object defining which document to update. Note: If the query finds more than one record, only the first occurrence is updated.
PyMongo is MongoDB's official native driver for Python. It's a library that lets you connect to a MongoDB database and query the data stored using the MongoDB Query API. It is the recommended way to interface with the document database.
You get an error because you try to insert documents with fields which conflict with that of existing documents on the second and subsequent insert_many
calls. You correctly inferred it may be due to your setting _id
explicitly, which would then conflict with existing _id
values in the collection.
MongoDB automatically creates an unique index on _id
, which forbids duplicating values.
You need to update or replace your documents on calls after the first one (which inserted the documents in their first version). There is indeed a concept of "upsert" which will take care of inserting non-previously-existing documents in the collection as well as updating the existing ones.
Your options:
Most efficient: pymongo.collection.Collection.bulk_write
import pymongo
operations = [pymongo.operations.ReplaceOne(
filter={"_id": doc["_id"]},
replacement=doc,
upsert=True
) for doc in json.loads(dfOut)]
result = db["test"].bulk_write(operations)
# handle results
Note that it's efficiency also depends on whether the field is indexed in the collection, which incidentally is the case for _id
.
(also see pymongo.operations.ReplaceOne
)
Loop over your collection and calling pymongo.collection.Collection.update_one
or pymongo.collection.Collection.replace_one
(inefficient because not bulk)
import pymongo
results = []
for doc in json.load(dfOut):
result = db["test"].replace_one(
filter={"_id": doc["_id"]},
replacement=doc,
upsert=True
)
results.append(result)
# handle results
Note: pymongo.collection.Collection.update_many
seems unsuitable for your needs since you are not trying to set the same value on all matches of a given filter.
batch op error maybe caused by duplicate _id, So delete the same _id documents already in mongo before inserting
Or use update_many https://api.mongodb.com/python/current/api/pymongo/collection.html?highlight=update#pymongo.collection.Collection.update_many
https://docs.mongodb.com/manual/reference/method/db.collection.updateMany/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With