I have a mongodb document which I want to add to a collection only if not existing but do not change an existing one.
in other words, I'm looking for an atomic way to:
1. find if a document exists (based on a given key criteria)
2. if it exists:
2.1 return it
otherwise:
2.1 add a new one
this is like the upsert option but instead if favours the existing document over the new one
P.S. if possible, I prefer not to use unique indexes
thanks all in advance
In MongoDB, the insert() method inserts a document or documents into the collection. It takes two parameters, the first parameter is the document or array of the document that we want to insert and the remaining are optional. Using this method you can also create a collection by inserting documents.
Find() Method. In MongoDB, find() method is used to select documents in a collection and return a cursor to the selected documents.
This operator is used to perform logical AND operation on the array of one or more expressions and select or retrieve only those documents that match all the given expression in the array. You can use this operator in methods like find(), update(), etc. according to your requirements.
I recently encountered this issue and made use of the upsert
flag as some have hinted at. I went through a number of approaches before settling on my recommended solution which is the last option described in this answer. Please forgive my use of PyMongo code. Hopefully it won't be difficult to translate to your project.
First, MongoDB's documentation explicitly warns against using upsert
without a unique index. It would seem the command itself is implemented using the standard "find/insert" approach and is NOT atomic. 2 concurrent clients could fail their finds and then each insert their own copy of the document. Without a unique index to enforce no duplicates, MongoDB would allow such an event to happen! Keep this in mind as you implement your solution.
from pymongo import ReturnDocument
objID = db.collection.find_one_and_update(
myDoc,
{"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}}, #There is no NOOP...
{}, #We only want the "_id".
return_document=ReturnDocument.AFTER, #IIRC an upsert would return a null without this.
upsert=True,
)["_id"]
Using the faux NOOP, I managed to convert the update
call to a find
call with an upsert
feature, successfully implementing a "insert if new" in a single MongoDB call. This roughly translates to the MongoDB client operation:
db.collection.findAndModify({
query: <your doc>,
update: {$unset: {"<<<IHopeThisIsNeverInTheDatabase>>>": ""}}, // There is no NOOP...
new: true, // IIRC an upsert would return a null without this.
fields: {}, // Only want the ObjectId
upsert: true, // Create if no matches.
})
A problem/feature with this code is that it will match documents that contain a superset of data from <your doc>
, not only an exact match. For example, consider a collection:
{"foo": "bar", "apples": "oranges"}
The above code will match the one document already in the collection to any of the following documents being uploaded:
{"foo": "bar"}
{"apples": "oranges"}
{"foo": "bar", "apples", "oranges"}
Therefore, it is not a true "insert if new" because it fails to ignore superset documents, but for some applications this may be good enough and will be very fast compared to the brute force approach.
If it is good enough to only match subdocuments:
q = {k: {"$eq": v} for k, v in myDoc.items()} #Insert "$eq" operator on root's subdocuments to require exact matches.
objID = db.collection.find_one_and_update(
q,
{"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}}, #There is no NOOP...
{}, #We only want the "_id".
return_document=ReturnDocument.AFTER, #IIRC an upsert would return a null without this.
upsert=True,
)["_id"]
Note that $eq
is order-dependent, so if you're dealing with data that is not order-dependent (e.g. Python dict
objects), this approach will not work.
There are 4 approaches I can think of for this, with the last one being my recommended approach.
You can expand the previous approach with root checking, adding client-side logic to check the root document and insert if there were no complete matches:
q = {k: {"$eq": v} for k, v in myDoc.items()} #Insert "$eq" operator on root's subdocuments to require exact matches.
resp = collection.update_many(
q,
{"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}}, #There is no NOOP...
True,
)
objID = resp.upserted_id
if objID is None:
#No upsert occurred. If you must, use a find to get the direct match:
docs = collection.find(q, {k: 0 for k in myDoc.keys()}, limit=resp.matched_count)
for doc in docs:
if len(doc) == 1: #Only match documents that have the "_id" field and nothing else.
objID = doc["_id"]
break
else: #No direct matches were found.
objID = collection.insert_one(myDoc, {}).inserted_id
Note the use of filtering known fields from the results of find
to cut down data usage and simplify our equivalence checking. I also toss in the resp.matched_count
for query limit so we don't waste time looking up documents that we know don't already exist.
Note that this approach is optimized for upsert
(2 insert calls in a single insert function...yuk!!!!) where you're creating documents more often than you're finding existing ones. In most "insert if new" situations I've encountered, the more common event is that the document already exists in which case you want to do a "find first & insert if missing" approach. This leads to the other options.
Do the $eq
-style query to match the subdocuments, then use client-side code to check the root and insert if no matches:
q = {k: {"$eq": v} for k, v in myDoc.items()} #Insert "$eq" operator on root's subdocuments to require exact matches.
docs = collection.find(q, {k: 0 for k in myDoc.keys()}) #Filter known fields so we isolate the mismatches.
for doc in docs:
if len(doc) == 1: #Only match documents that have the "_id" field and nothing else.
objID = doc["_id"]
break
else: #No direct matches were found.
objID = collection.insert_one(myDoc, {}).inserted_id
Again $eq
is order-dependent which could cause problems depending on your situation.
If you want to go order-independent, you can construct your query by simply flattening the JSON document. This bloats your query with duplicate parents in the map tree, but this could be okay depending on your use case.
myDoc = {"llama": {"duck": "cake", "ate": "rake"}}
q = {"llama.duck": "cake", "llama.ate": "rake"}
docs = collection.find(q, {k: 0 for k in q.keys()}) #Filter known fields so we isolate the mismatches.
for doc in docs:
if len(doc) == 1: #Only match documents that have the "_id" field and nothing else.
objID = doc["_id"]
break
else: #No direct matches were found.
objID = collection.insert_one(myDoc, {}).inserted_id
There's likely a way to do this all server-side using JavaScript. Unfortunately, my JavaScript-fu is lacking at the moment.
Make the unique index requirement work for you, creating that index on a hash of the document's information as suggested in this answer for another SO question: https://stackoverflow.com/a/27993841/2201287 . Ideally, this hash can be generated from the data alone, allowing you to create the hash without ever talking to MongoDB. The author of the linked answer does a SHA-256
hash on the string representation of the JSON document. For this project I was already using xxHash
and thus opted for a xxHash
on the bson.json_util.dumps(myDoc)
output where myDoc
is the dict
, collections.OrderedDict
, or bson.son.SON
object that I want to upload. Since I'm in Python with duck-typing and all that jazz, using json_util
gives me the post-conversion state of the SON document and thus ensures that the hash generation is platform-agnostic in case I want to generate these hashes in another program/language. Note that hashes are generally order-dependent, so using unordered structures like Python's dict
will cause different hashes for duplicate data. In the event the user hands me a dict
, I wrote a simple utility function that recursively converts dict
objects to bson.son.SON
objects with keys sorted via Python's sorted
function.
Once you have a hash or other unique value that represents your data and have created a unique index in MongoDB for that key, you can use the simple upsert
approach to accomplish your "insert if new" function.
from pymongo import ReturnDocument
myDoc["xxHash"] = xxHashValue #32-bit signed integer generated from xxHash of "bson.json_util.dumps(myDoc)"
objID = db.collection.find_one_and_update(
myDoc,
{"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}}, #There is no NOOP...
{}, #We only want the "_id".
return_document=ReturnDocument.AFTER, #IIRC an upsert would return a null without this.
upsert=True,
)["_id"]
All the DB work happens in one short command and is blazingly fast with indexing. The hard part is just generating the hash.
So there you have a number of approaches that may fit your particular situation. Of course, if MongoDB had just supported root-level equivalence testing this would be a lot easier, but the hash approach is a great alternative and likely delivers the best speed overall.
Look at MongoDB's findAndModify
method.
It may fit almost all your criteria.
upsert
option.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With