Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is there an option to find-or-insert in mongodb

Tags:

mongodb

I have a mongodb document which I want to add to a collection only if not existing but do not change an existing one.

in other words, I'm looking for an atomic way to:

1. find if a document exists (based on a given key criteria)
2. if it exists: 
2.1   return it
   otherwise:
2.1   add a new one

this is like the upsert option but instead if favours the existing document over the new one

P.S. if possible, I prefer not to use unique indexes

thanks all in advance

like image 649
akiva Avatar asked Mar 17 '14 12:03

akiva


People also ask

What is insert command in MongoDB?

In MongoDB, the insert() method inserts a document or documents into the collection. It takes two parameters, the first parameter is the document or array of the document that we want to insert and the remaining are optional. Using this method you can also create a collection by inserting documents.

What does find () do in MongoDB?

Find() Method. In MongoDB, find() method is used to select documents in a collection and return a cursor to the selected documents.

Can we use and in find MongoDB?

This operator is used to perform logical AND operation on the array of one or more expressions and select or retrieve only those documents that match all the given expression in the array. You can use this operator in methods like find(), update(), etc. according to your requirements.


2 Answers

I recently encountered this issue and made use of the upsert flag as some have hinted at. I went through a number of approaches before settling on my recommended solution which is the last option described in this answer. Please forgive my use of PyMongo code. Hopefully it won't be difficult to translate to your project.

First, MongoDB's documentation explicitly warns against using upsert without a unique index. It would seem the command itself is implemented using the standard "find/insert" approach and is NOT atomic. 2 concurrent clients could fail their finds and then each insert their own copy of the document. Without a unique index to enforce no duplicates, MongoDB would allow such an event to happen! Keep this in mind as you implement your solution.

Insert if not Subset of Existing

from pymongo import ReturnDocument
objID = db.collection.find_one_and_update(
    myDoc,
    {"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}},  #There is no NOOP...
    {},  #We only want the "_id".
    return_document=ReturnDocument.AFTER,  #IIRC an upsert would return a null without this.
    upsert=True,
)["_id"]

Using the faux NOOP, I managed to convert the update call to a find call with an upsert feature, successfully implementing a "insert if new" in a single MongoDB call. This roughly translates to the MongoDB client operation:

db.collection.findAndModify({
    query: <your doc>,
    update: {$unset: {"<<<IHopeThisIsNeverInTheDatabase>>>": ""}},  // There is no NOOP...
    new: true,  // IIRC an upsert would return a null without this.
    fields: {},  // Only want the ObjectId
    upsert: true,  // Create if no matches.
})

A problem/feature with this code is that it will match documents that contain a superset of data from <your doc>, not only an exact match. For example, consider a collection:

{"foo": "bar", "apples": "oranges"}

The above code will match the one document already in the collection to any of the following documents being uploaded:

{"foo": "bar"}
{"apples": "oranges"}
{"foo": "bar", "apples", "oranges"}

Therefore, it is not a true "insert if new" because it fails to ignore superset documents, but for some applications this may be good enough and will be very fast compared to the brute force approach.

Insert if Subdocuments are not Exact Match

If it is good enough to only match subdocuments:

q = {k: {"$eq": v} for k, v in myDoc.items()}  #Insert "$eq" operator on root's subdocuments to require exact matches.
objID = db.collection.find_one_and_update(
    q,
    {"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}},  #There is no NOOP...
    {},  #We only want the "_id".
    return_document=ReturnDocument.AFTER,  #IIRC an upsert would return a null without this.
    upsert=True,
)["_id"]

Note that $eq is order-dependent, so if you're dealing with data that is not order-dependent (e.g. Python dict objects), this approach will not work.

Insert if Entire Document is not Exact Match

There are 4 approaches I can think of for this, with the last one being my recommended approach.

Upsert-Optimized Find and Insert

You can expand the previous approach with root checking, adding client-side logic to check the root document and insert if there were no complete matches:

q = {k: {"$eq": v} for k, v in myDoc.items()}  #Insert "$eq" operator on root's subdocuments to require exact matches.
resp = collection.update_many(
    q,
    {"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}},  #There is no NOOP...
    True,
)
objID = resp.upserted_id
if objID is None:
    #No upsert occurred.  If you must, use a find to get the direct match:
    docs = collection.find(q, {k: 0 for k in myDoc.keys()}, limit=resp.matched_count)
    for doc in docs:
        if len(doc) == 1:  #Only match documents that have the "_id" field and nothing else.
            objID = doc["_id"]
            break
    else:  #No direct matches were found.
        objID = collection.insert_one(myDoc, {}).inserted_id

Note the use of filtering known fields from the results of find to cut down data usage and simplify our equivalence checking. I also toss in the resp.matched_count for query limit so we don't waste time looking up documents that we know don't already exist.

Note that this approach is optimized for upsert (2 insert calls in a single insert function...yuk!!!!) where you're creating documents more often than you're finding existing ones. In most "insert if new" situations I've encountered, the more common event is that the document already exists in which case you want to do a "find first & insert if missing" approach. This leads to the other options.

Order Dependent Find and Insert

Do the $eq-style query to match the subdocuments, then use client-side code to check the root and insert if no matches:

q = {k: {"$eq": v} for k, v in myDoc.items()}  #Insert "$eq" operator on root's subdocuments to require exact matches.
docs = collection.find(q, {k: 0 for k in myDoc.keys()})  #Filter known fields so we isolate the mismatches.
for doc in docs:
    if len(doc) == 1:  #Only match documents that have the "_id" field and nothing else.
        objID = doc["_id"]
        break
else:  #No direct matches were found.
    objID = collection.insert_one(myDoc, {}).inserted_id

Again $eq is order-dependent which could cause problems depending on your situation.

Unordered Find and Insert

If you want to go order-independent, you can construct your query by simply flattening the JSON document. This bloats your query with duplicate parents in the map tree, but this could be okay depending on your use case.

myDoc = {"llama": {"duck": "cake", "ate": "rake"}}
q = {"llama.duck": "cake", "llama.ate": "rake"}
docs = collection.find(q, {k: 0 for k in q.keys()})  #Filter known fields so we isolate the mismatches.
for doc in docs:
    if len(doc) == 1:  #Only match documents that have the "_id" field and nothing else.
        objID = doc["_id"]
        break
else:  #No direct matches were found.
    objID = collection.insert_one(myDoc, {}).inserted_id

There's likely a way to do this all server-side using JavaScript. Unfortunately, my JavaScript-fu is lacking at the moment.

Hash as Unique Index (RECOMMENDED)

Make the unique index requirement work for you, creating that index on a hash of the document's information as suggested in this answer for another SO question: https://stackoverflow.com/a/27993841/2201287 . Ideally, this hash can be generated from the data alone, allowing you to create the hash without ever talking to MongoDB. The author of the linked answer does a SHA-256 hash on the string representation of the JSON document. For this project I was already using xxHash and thus opted for a xxHash on the bson.json_util.dumps(myDoc) output where myDoc is the dict, collections.OrderedDict, or bson.son.SON object that I want to upload. Since I'm in Python with duck-typing and all that jazz, using json_util gives me the post-conversion state of the SON document and thus ensures that the hash generation is platform-agnostic in case I want to generate these hashes in another program/language. Note that hashes are generally order-dependent, so using unordered structures like Python's dict will cause different hashes for duplicate data. In the event the user hands me a dict, I wrote a simple utility function that recursively converts dict objects to bson.son.SON objects with keys sorted via Python's sorted function.

Once you have a hash or other unique value that represents your data and have created a unique index in MongoDB for that key, you can use the simple upsert approach to accomplish your "insert if new" function.

from pymongo import ReturnDocument
myDoc["xxHash"] = xxHashValue  #32-bit signed integer generated from xxHash of "bson.json_util.dumps(myDoc)"
objID = db.collection.find_one_and_update(
    myDoc,
    {"$unset": {"<<<IHopeThisIsNeverInTheDB>>>": ""}},  #There is no NOOP...
    {},  #We only want the "_id".
    return_document=ReturnDocument.AFTER,  #IIRC an upsert would return a null without this.
    upsert=True,
)["_id"]

All the DB work happens in one short command and is blazingly fast with indexing. The hard part is just generating the hash.

So there you have a number of approaches that may fit your particular situation. Of course, if MongoDB had just supported root-level equivalence testing this would be a lot easier, but the hash approach is a great alternative and likely delivers the best speed overall.

like image 189
John Crawford Avatar answered Nov 15 '22 04:11

John Crawford


Look at MongoDB's findAndModify method.

It may fit almost all your criteria.

  1. On a single document, it is atomic.
  2. It has an upsert option.
  3. By default it returns the pre-modified document.
  4. Can also remove the document if required.
like image 45
Ewan Avatar answered Nov 15 '22 06:11

Ewan