I have a MongoDb production cluster running 2.6.5 that I recently migrated from two to three shards. I had been running as two shards for about a year. Each shard is a 3-server replica set and I have one collection sharded.
The sharded collection is about 240G, and with the new shard I now have evenly distributed chunks of 2922 on each shard. My production environment appears to be performing just fine. There is no problem accessing data.
[Note: 1461 should be the number of chunks moved from rs0 and shard1 to make 2922 on shard2.]
My intent was to shard three more collections, so I started with one and expected it to spread across the shards. But no - I ended up with this repeating error:
2014-10-29T20:26:35.374+0000 [Balancer] moveChunk result: { cause: { ok: 0.0, errmsg: "can't accept new chunks because there are still 1461 deletes from previous migration" },
ok: 0.0, errmsg: "moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because there are still 1461 deletes from previous migration" }
2014-10-29T20:26:35.375+0000 [Balancer] balancer move failed: { cause: { ok: 0.0, errmsg: "can't accept new chunks because there are still 1461 deletes from previous migration" },
ok: 0.0, errmsg: "moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because there are still 1461 deletes from previous migration" } from: rs0 to: shard1 chunk: min: { account_id: MinKey } max: { account_id: -9218254227106808901 }
With a little research I figured I should just give it some time, since obviously it needs to clean things up after the move. I ran sh.disableBalancing("collection-name") to stop the errors from attempting to shard the new collection. sh.getBalancerState shows true, as does sh.isBalancerRunning. However, I gave it 24 hours and the error message is the same. I would think it would have cleaned up/deleted at least 1 of the 1461 it needs to delete.
Thanks in advance for any insights.
It's not common to see this kind of issue, but I have seen it occur sporadically.
The best remedial action to take here is to step down the primary of the referenced TO shard which will clear out the background deletes. The delete threads only exist on the current primary (they will be replicated from that primary via the oplog
as they are processed). When you step it down, it becomes a secondary, the threads can no longer write and you get a new primary with no pending deletes. You may wish to restart the former primary after the step down to clear out old cursors, but it's not usually urgent.
Once you do this, you will be left with a large number of orphaned documents, which can be addresses with the cleanUpOrphaned
command which I would recommend running at low traffic times (if you have such times).
For reference, if this is a recurring problem, then it is likely the primaries are struggling a little in terms of load, and to avoid the queuing up of deletes you can set the _waitForDelete
option for the balancer to true (false by default) as follows:
use config
db.settings.update(
{ "_id" : "balancer" },
{ $set : { "_waitForDelete" : true } },
{ upsert : true }
)
This will mean that each migration is slower (perhaps significantly so) but will not cause the background deletes to accumulate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With