My use case is as follows - I have a collection of documents in mongoDB which I have to send for analysis. The format of the documents are as follows -
{ _id:ObjectId("517e769164702dacea7c40d8") , date:"1359911127494", status:"available", other_fields... }
I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing. ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.
Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?
Also, in future for scalability, we might go with more than one reader process. In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.
Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?
In MongoDB, the Bulk. insert() method is used to perform insert operations in bulk. Or in other words, the Bulk. insert() method is used to insert multiple documents in one go.
MongoDB – FindAndModify() Method. The findAndModify() method modifies and return a single document that matches the given criteria. By default, this method returns a pre-modification document. To return the document with the modifications made on the update, use the new option and set its value to true.
Update Multiple Fields of Multiple Documents. In addition, we can also update multiple fields of more than one document in MongoDB. We simply need to include the option multi:true to modify all documents that match the filter query criteria: db.
MongoDB findAndModify() method modifies and returns a single document based upon the selection criteria entered. The returned document does not show the updated content by default. If the records matching the criteria does not exist in the database, a new record will be inserted if the upsert is set to true.
As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :
e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true)
)Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).
EDIT: More details :
All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :
Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With