I'm trying to downsize a sharded cluster which currently has 8 shards, to a cluster with 4 shards.
I've started with the 8th shard and tried removing it first.
db.adminCommand( { removeShard : "rs8" } );
----
{
"msg" : "draining ongoing",
"state" : "ongoing",
"remaining" : {
"chunks" : NumberLong(1575),
"dbs" : NumberLong(0)
},
"note" : "you need to drop or movePrimary these databases",
"dbsToMove" : [ ],
"ok" : 1
}
So there are 1575 chunks to be migrated to the rest of the cluster.
But running sh.isBalancerRunning()
I get the value false
and also the output of sh.status()
is like the following:
...
...
active mongoses:
"3.4.10" : 16
autosplit:
Currently enabled: yes
balancer:
Currently enabled: yes
Currently running: no
NaN
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
59 : Success
1 : Failed with error 'aborted', from rs8 to rs1
1 : Failed with error 'aborted', from rs2 to rs6
1 : Failed with error 'aborted', from rs8 to rs5
4929 : Failed with error 'aborted', from rs2 to rs7
1 : Failed with error 'aborted', from rs8 to rs2
506 : Failed with error 'aborted', from rs8 to rs7
1 : Failed with error 'aborted', from rs2 to rs3
...
So the balancer is enabled, but not running. But there is a draining shard (rs8) that's being removed, so I think the balancer should be constantly running, right? It's not though, as evident in the logs I provided above.
Also the process is taking incredibly long, for the past nearly day, the number of remaining chunks have decreased only by 10 chunks, from 1575 to 1565! This way, it's gonna take months for me to reduce a sharded cluster of 8 instances to a sharded cluster of 4 instances!
It also seems MongoDB itself doesn't stop writes to the draining shard, so what I'm experiencing is that the rate of chunks increasing, maybe is nearly canceling out their decrease?
Any help is greatly appreciated!
Thanks
EDIT
Great, now after exactly a month, the process is over and I have a 4 shard cluster! Doing the trick I described below helped reduce the time it could have took anyways, but honestly, the slowest thing I've ever done.
Ok, so answering my own here,
I couldn't get the automatic balancing behavior to work as fast as I wanted, each day what I observed was that about 5 to 7 chunks would have been migrated (meaning the whole process would take years!)
What I did to kinda overcome this issue, was to use the moveChunk command manually.
So what I basically did was:
while 'can still sample':
// Sample the 8th shard for 100 documents
db.col.aggreagte([{$sample: {size: 100}}])
For every document:
db.moveChunk(namespace, {shardKey: value}, `rs${NUM}`);
So I'm manually moving chunks out of the 8th shard to the first 4 shards (one downside being since we need the balancer to be enabled and only one shard can be draining at every moment, some of those migrated chunks will be again migrated automatically to shards 5-7, which I wanna later remove too, this leads into the process taking more time, any solutions?).
Since the 8th shard is draining, it won't be filled again with the balancer and now the whole process is much faster, about 350-400 chunks per day. So hopefully each shard will take about 5 days at most and then the whole resize would take about 20 days!
That's the fastest I could make it, I appreciate anyone with any other answers or strategies to perform this downsize better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With