I have a mongo script that I am using to perform some data cleanup after a database migration.
When I run this script locally it finishes in about 5 minutes. When I run the script from my local machine against a remote instance it takes forever (I usually kill it after about two hours). These databases are essentially identical. Indexes are all the same, maybe a few records in one place that isn't in the other.
I am executing the script like so:
Locally-
mongo localDatabase script.js
Against remote instance-
mongo removeServer/remoteDatabase -u user -p password script.js
I had assumed that since I was passing the script to the remote instance it would be executed entirely on the remote machine with no data having to be transported back and forth between the remote machine and my local machine (and hence there would be little difference in performance).
Is this assumption correct? Any idea why I am seeing the huge performance difference between local/remote? Suggestions on how to fix?
Yes you can use Bulk operations, all operations in MongoDB are designed around a single collection, but there is nothing wrong with looping one collection and inserting or updating another collection.
In fact in the MongoDB 2.6 shell it is the best way to do it, and the actual collection methods themselves try to use the "Bulk" methods under the hood, even though they actually only do single updates/inserts per operation. That is why you would see the different response in the shell.
Note that your server needs to be a MongoDB 2.6 or greater instance as well, which is is why the collection methods in the shell do some detection in case you are connecting to an older server.
But basically your process is:
var bulk = db.targetcollection.initializeOrderedBulkOP();
var counter = 0;
db.sourcecollection.find().forEach(function(doc) {
bulk.find({ "_id": doc._id }).updateOne(
// update operations here
);
counter++;
if ( counter % 1000 == 0 ) {
bulk.execute();
bulk = db.targetcollection.initializeOrderedBulkOP();
}
});
if ( counter % 1000 != 0 )
bulk.execute();
The Bulk API itself will keep all operations you send to it "queued" up until an execute is called which sends operations to the server. The API itself will just keep whatever operations to "queue" until this is called but only actually send in batches of 1000 entries at at time. A little extra care is taken here to manually limit this with a modulo in order to avoid using up additional memory.
You can tune that amount to your needs, but remember there is indeed a hard limit of 16MB as this basically translates to a a BSON document as a request.
See the full manual page for all options including upserts, multi-updates, insert and remove. Or even un-ordered operations where the order or failure on individual error is not important.
Also note that the write result in the latter case would return the error items in a list if any, as well as the response containing things such as lists of upserts where those are applied.
Combined with keeping your shell instance as close as possible to the server, the reduced "back and forth" traffic will speed things up. As I said, the shell is using these anyway, so you might as well leverage these to your advantage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With