I'm running a Kafka cluster on 3 EC2 instances. Each instance runs kafka (0.11.0.1) and zookeeper (3.4). My topics are configured so that each has 20 partitions and ReplicationFactor of 3.
Today I noticed that some partitions refuse to sync to all three nodes. Here's an example:
bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" --describe --topic prod-decline
Topic:prod-decline PartitionCount:20 ReplicationFactor:3 Configs:
Topic: prod-decline Partition: 0 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 1 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 2 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 3 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 4 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 5 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 6 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 7 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 8 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 9 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 10 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 11 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 12 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 13 Leader: 2 Replicas: 2,0,1 Isr: 2
Topic: prod-decline Partition: 14 Leader: 0 Replicas: 0,1,2 Isr: 2,0,1
Topic: prod-decline Partition: 15 Leader: 1 Replicas: 1,0,2 Isr: 2,0,1
Topic: prod-decline Partition: 16 Leader: 2 Replicas: 2,1,0 Isr: 2
Topic: prod-decline Partition: 17 Leader: 2 Replicas: 0,2,1 Isr: 2
Topic: prod-decline Partition: 18 Leader: 2 Replicas: 1,2,0 Isr: 2
Topic: prod-decline Partition: 19 Leader: 2 Replicas: 2,0,1 Isr: 2
Only node 2 has all the data in-sync. I've tried restarting brokers 0 and 1 but it didn't improve the situation - it made it even worse. I'm tempted to restart node 2 but I'm assuming it will lead to downtime or cluster failure so I'd like to avoid it if possible.
I'm not seeing any obvious errors in logs so I'm having a hard time figuring out how to debug the situation. Any tips would be greatly appreciated.
Thanks!
EDIT: Some additional info ... If I check the metrics on node 2 (the one with full data), it does realize that some partitions are not correctly replicated.:
$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 930;
Nodes 0 and 1 don't. They seem to think everything is fine:
$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 0;
Is this expected behaviour?
Try increasing replica.lag.time.max.ms
.
Explanation goes like this:
If a replica fails to send a fetch request for longer than replica.lag.time.max.ms
, it is considered dead and is removed from the ISR.
If a replica starts lagging behind the leader for longer than replica.lag.time.max.ms
, then it is considered too slow and is removed from the ISR. So even if there is a spike in traffic and large batches of messages are written on the leader, unless the replica consistently remains behind the leader for replica.lag.time.max.ms, it will not shuffle in and out of the ISR.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With