We have a cluster of 4 nodes in production. We observed that one of the nodes ran into a situation where it constantly shrunk and expanded ISR for more than 1 hours and unable to recover until the broker was bounced.
[2017-02-21 14:52:16,518] INFO Partition [skynet-large-stage,5] on broker 0: Shrinking ISR for partition [skynet-large-stage,5] from 2,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,543] INFO Partition [skynet-large-stage,37] on broker 0: Shrinking ISR for partition [skynet-large-stage,37] from 1,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,544] INFO Partition [skynet-large-stage,13] on broker 0: Shrinking ISR for partition [skynet-large-stage,13] from 1,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,545] INFO Partition [__consumer_offsets,46] on broker 0: Shrinking ISR for partition [__consumer_offsets,46] from 3,2,0 to 3,0 (kafka.cluster.Partition)
.
.
I'd like to know what would cause this issue and why the broken broker was not kicked out of ISR.
Kafka version is 0.10.1.0
There was that bug in KAFKA-4477 that got fixed, but in general, I've seen this same problem when Kafka brokers time out when talking to a zookeeper node (default is 6000ms timeout), for some transient network blip, at which point they get kicked out of the cluster, partition leadership changes, clients have to rebalance, etc. For high volume clusters, it's a pain.
Simply increasing this timeout has helped me several times before:
zookeeper.session.timeout.ms
The default value according to the official docs is 6000ms. I found simply increasing it to 15000ms caused the cluster to be rock solid.
Documentation for 0.11.0 Kafka version: https://kafka.apache.org/0110/documentation.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With