Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because

I have a MongoDb production cluster running 2.6.5 that I recently migrated from two to three shards. I had been running as two shards for about a year. Each shard is a 3-server replica set and I have one collection sharded.
The sharded collection is about 240G, and with the new shard I now have evenly distributed chunks of 2922 on each shard. My production environment appears to be performing just fine. There is no problem accessing data.

[Note: 1461 should be the number of chunks moved from rs0 and shard1 to make 2922 on shard2.]

My intent was to shard three more collections, so I started with one and expected it to spread across the shards. But no - I ended up with this repeating error:

2014-10-29T20:26:35.374+0000 [Balancer] moveChunk result: { cause: { ok: 0.0, errmsg: "can't accept new chunks because there are still 1461 deletes from previous migration" },

ok: 0.0, errmsg: "moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because there are still 1461 deletes from previous migration" }

2014-10-29T20:26:35.375+0000 [Balancer] balancer move failed: { cause: { ok: 0.0, errmsg: "can't accept new chunks because there are still 1461 deletes from previous migration" },

ok: 0.0, errmsg: "moveChunk failed to engage TO-shard in the data transfer: can't accept new chunks because there are still 1461 deletes from previous migration" } from: rs0 to: shard1 chunk: min: { account_id: MinKey } max: { account_id: -9218254227106808901 }

With a little research I figured I should just give it some time, since obviously it needs to clean things up after the move. I ran sh.disableBalancing("collection-name") to stop the errors from attempting to shard the new collection. sh.getBalancerState shows true, as does sh.isBalancerRunning. However, I gave it 24 hours and the error message is the same. I would think it would have cleaned up/deleted at least 1 of the 1461 it needs to delete.

  1. Is this common behavior now in 2.6 world? Am I going to have to manhandle all my sharded collections every time I grow the environment by another shard?
  2. Any idea how to get this cleanup going? or should I just step down the primary on shard1, which seems to be the issue?
  3. If I do step down the primary, will I still have files to 'delete/cleanup' on the secondary anyway? Or will this take care of things so I can start sharding some new collections?

Thanks in advance for any insights.

like image 754
Jeff Goddard Avatar asked Oct 29 '14 21:10

Jeff Goddard


1 Answers

It's not common to see this kind of issue, but I have seen it occur sporadically.

The best remedial action to take here is to step down the primary of the referenced TO shard which will clear out the background deletes. The delete threads only exist on the current primary (they will be replicated from that primary via the oplog as they are processed). When you step it down, it becomes a secondary, the threads can no longer write and you get a new primary with no pending deletes. You may wish to restart the former primary after the step down to clear out old cursors, but it's not usually urgent.

Once you do this, you will be left with a large number of orphaned documents, which can be addresses with the cleanUpOrphaned command which I would recommend running at low traffic times (if you have such times).

For reference, if this is a recurring problem, then it is likely the primaries are struggling a little in terms of load, and to avoid the queuing up of deletes you can set the _waitForDelete option for the balancer to true (false by default) as follows:

use config
db.settings.update(
   { "_id" : "balancer" },
   { $set : { "_waitForDelete" : true } },
   { upsert : true }
)

This will mean that each migration is slower (perhaps significantly so) but will not cause the background deletes to accumulate.

like image 67
Adam Comerford Avatar answered Oct 24 '22 03:10

Adam Comerford