My use case is as follows - I have a collection of documents in mongoDB which I have to send for analysis. The format of the documents are as follows - { _id:ObjectId("517e769164702dacea7c40d8") , date:"1359911127494", status:"available", other_fields... } I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing. ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed. Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case? Also, in future for scalability, we might go with more than one reader process. In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents. Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?

As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this : <ol> <li>Reader selects X documents with appropriate limit and sorting</li> <li>Reader marks the documents returned by 1) with it's own unique reader ID (<code>e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true)</code>)</li> <li>Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.</li> <li>Offer the resultset from 3) for your processing.</li> </ol> Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail). EDIT: More details : All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps : <ul> <li>Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.</li> <li>Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here). <ul> <li>Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.</li> </ul> </li> </ul> Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.

Solution to Bulk FindAndModify in MongoDB

Tags:

mongodb

nosql

My use case is as follows - I have a collection of documents in mongoDB which I have to send for analysis. The format of the documents are as follows -

{ _id:ObjectId("517e769164702dacea7c40d8") , date:"1359911127494", status:"available", other_fields... }

I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing. ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.

Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?

Also, in future for scalability, we might go with more than one reader process. In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.

Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?

675

asked May 02 '13 09:05

ameykpatil

1 Answers

As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :

Reader selects X documents with appropriate limit and sorting
Reader marks the documents returned by 1) with it's own unique reader ID (e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true))
Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.
Offer the resultset from 3) for your processing.

Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).

EDIT: More details :

All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :

Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.
Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here).
- Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.

Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.

answered Oct 12 '22 04:10

Remon van Vliet

Related questions
                            
                                MongoDB Aggregation Framework performance slow over millions of documents
                            
                                MongoDb TTL on nested document is possible?
                            
                                MongoWaitQueueFullException: The wait queue for acquiring a connection to server is full
                            
                                Ways to connect mongodb to grafana
                            
                                ReactJS: How to deploy on local server
                            
                                Is moving documents between collections a good way to represent state changes in MongoDB?
                            
                                Multiple $elemMatch expressions for matching array values using $all in MongoDB?
                            
                                Spring -Data MongoDB issue with field which is an interface
                            
                                Using the mongo C# driver, how to serialize an array of custom object in order to store it?
                            
                                mongodb group every 2 weeks
                            
                                Spark Streaming: foreachRDD update my mongo RDD
                            
                                How to achieve rollback in transactions in mongo? [closed]
                            
                                mongodb unwind array nested inside an array of documents
                            
                                How can I check mongodb query performance without cache
                            
                                How to use Aggregate in mongoose
                            
                                Is a connection to MongoDB automatically closed on process.exit()?
                            
                                What is the best way to store single non-repeating data to a database?
                            
                                MongoDB C# Driver multiple field query
                            
                                MongoDB aggreagte fill missing days [duplicate]
                            
                                Random sort order

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With