Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB draining shard but balancer not running? (removeShard taking too much time)

I'm trying to downsize a sharded cluster which currently has 8 shards, to a cluster with 4 shards.

I've started with the 8th shard and tried removing it first.

db.adminCommand( { removeShard : "rs8" } );
----
{
    "msg" : "draining ongoing",
    "state" : "ongoing",
    "remaining" : {
        "chunks" : NumberLong(1575),
        "dbs" : NumberLong(0)
    },
    "note" : "you need to drop or movePrimary these databases",
    "dbsToMove" : [ ],
    "ok" : 1
}

So there are 1575 chunks to be migrated to the rest of the cluster.

But running sh.isBalancerRunning() I get the value false and also the output of sh.status() is like the following:

  ...
  ...

  active mongoses:
        "3.4.10" : 16
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
NaN
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours: 
                59 : Success
                1 : Failed with error 'aborted', from rs8 to rs1
                1 : Failed with error 'aborted', from rs2 to rs6
                1 : Failed with error 'aborted', from rs8 to rs5
                4929 : Failed with error 'aborted', from rs2 to rs7
                1 : Failed with error 'aborted', from rs8 to rs2
                506 : Failed with error 'aborted', from rs8 to rs7
                1 : Failed with error 'aborted', from rs2 to rs3
...

So the balancer is enabled, but not running. But there is a draining shard (rs8) that's being removed, so I think the balancer should be constantly running, right? It's not though, as evident in the logs I provided above.

Also the process is taking incredibly long, for the past nearly day, the number of remaining chunks have decreased only by 10 chunks, from 1575 to 1565! This way, it's gonna take months for me to reduce a sharded cluster of 8 instances to a sharded cluster of 4 instances!

It also seems MongoDB itself doesn't stop writes to the draining shard, so what I'm experiencing is that the rate of chunks increasing, maybe is nearly canceling out their decrease?

Any help is greatly appreciated!
Thanks

like image 831
SpiXel Avatar asked Nov 08 '22 00:11

SpiXel


1 Answers

EDIT

Great, now after exactly a month, the process is over and I have a 4 shard cluster! Doing the trick I described below helped reduce the time it could have took anyways, but honestly, the slowest thing I've ever done.


Ok, so answering my own here,

I couldn't get the automatic balancing behavior to work as fast as I wanted, each day what I observed was that about 5 to 7 chunks would have been migrated (meaning the whole process would take years!)

What I did to kinda overcome this issue, was to use the moveChunk command manually.

So what I basically did was:

while 'can still sample':
    // Sample the 8th shard for 100 documents
    db.col.aggreagte([{$sample: {size: 100}}])

    For every document:
        db.moveChunk(namespace, {shardKey: value}, `rs${NUM}`);

So I'm manually moving chunks out of the 8th shard to the first 4 shards (one downside being since we need the balancer to be enabled and only one shard can be draining at every moment, some of those migrated chunks will be again migrated automatically to shards 5-7, which I wanna later remove too, this leads into the process taking more time, any solutions?).

Since the 8th shard is draining, it won't be filled again with the balancer and now the whole process is much faster, about 350-400 chunks per day. So hopefully each shard will take about 5 days at most and then the whole resize would take about 20 days!

That's the fastest I could make it, I appreciate anyone with any other answers or strategies to perform this downsize better.

like image 146
SpiXel Avatar answered Nov 15 '22 06:11

SpiXel