We have a setup of two mongodb shards. Each shard contains a master, a slave, a 24h slave delay slave and an arbiter. However the balancer fails to migrate any shards waiting for the delayed slave to migrate. I have tried setting _secondaryThrottle to false in the balancer config, but I still have the issue.
It seems the migration goes on for a day and then fails (A ton of waiting for slave messages in the logs). Eventually it gives up and starts a new migration. The message says waiting for 3 slaves, but the delay slave is hidden and prio 0 so it should wait for that one. And if the _secondaryThrottle worked it should not wait for any slave right?
It's been like this for a few months now so the config should have been reloaded on all mongoses. Some of the mongoses running the balancer have been restarter recently.
Does anyone have any idea how to solve the problem, we did not have these issues before starting the delayed slave, but it's just our theory.
Config:
{ "_id" : "balancer", "_secondaryThrottle" : false, "stopped" : false }
Log from shard1 master process:
[migrateThread] warning: migrate commit waiting for 3 slaves for 'xxx.xxx' { shardkey: ObjectId('4fd2025ae087c37d32039a9e') } -> {shardkey: ObjectId('4fd2035ae087c37f04014a79') } waiting for: 529dc9d9:7a [migrateThread] Waiting for replication to catch up before entering critical section
Log from shard2 master process:
Tue Dec 3 14:52:25.302 [conn1369472] moveChunk data transfer progress: { active: true, ns: "xxx.xxx", from: "shard2/mongo2:27018,mongob2:27018", min: { shardkey: ObjectId('4fd2025ae087c37d32039a9e') }, max: { shardkey: ObjectId('4fd2035ae087c37f04014a79') }, shardKeyPattern: { shardkey: 1.0 }, state: "catchup", counts: { cloned: 22773, clonedBytes: 36323458, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Update: I confirmed that removing slaveDelay got the balancer working again. As soon as they got up to speed chunks moved. So the problem seems to be related to the slaveDelay. I also confirmed that the balancer runs with "secondaryThrottle" : false. It does seem to wait for slaves anyway.
Shard2:
Tue Dec 10 11:44:25.423 [migrateThread] warning: migrate commit waiting for 3 slaves for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') } waiting for: 52a6f089:81
Tue Dec 10 11:44:26.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:27.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:28.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:29.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:30.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.425 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.647 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.667 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
The balancer is properly waiting for the MAJORITY of the replica set of the destination shard to have the documents being migrated before initiating the delete of those documents on the source shard.
The issue is that you have FOUR members in your replica set (master, a slave, a 24h slave delay slave and an arbiter). That means three is the majority. I'm not sure why you added an arbiter, but if you remove it, then TWO will be the majority and the balancer will not have to wait for the delayed slave.
The alternate way of achieving the same result is to set up the delayed slave with votes:0
property and leave the arbiter as the third voting node.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With