Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka broker constantly ISR shrinking and expanding?

Tags:

apache-kafka

We have a cluster of 4 nodes in production. We observed that one of the nodes ran into a situation where it constantly shrunk and expanded ISR for more than 1 hours and unable to recover until the broker was bounced.

[2017-02-21 14:52:16,518] INFO Partition [skynet-large-stage,5] on broker 0: Shrinking ISR for partition [skynet-large-stage,5] from 2,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,543] INFO Partition [skynet-large-stage,37] on broker 0: Shrinking ISR for partition [skynet-large-stage,37] from 1,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,544] INFO Partition [skynet-large-stage,13] on broker 0: Shrinking ISR for partition [skynet-large-stage,13] from 1,0 to 0 (kafka.cluster.Partition)
[2017-02-21 14:52:16,545] INFO Partition [__consumer_offsets,46] on broker 0: Shrinking ISR for partition [__consumer_offsets,46] from 3,2,0 to 3,0 (kafka.cluster.Partition)
.
.

I'd like to know what would cause this issue and why the broken broker was not kicked out of ISR.

Kafka version is 0.10.1.0

like image 674
Baby.zhou Avatar asked Feb 21 '17 10:02

Baby.zhou


1 Answers

There was that bug in KAFKA-4477 that got fixed, but in general, I've seen this same problem when Kafka brokers time out when talking to a zookeeper node (default is 6000ms timeout), for some transient network blip, at which point they get kicked out of the cluster, partition leadership changes, clients have to rebalance, etc. For high volume clusters, it's a pain.

Simply increasing this timeout has helped me several times before:

 zookeeper.session.timeout.ms

The default value according to the official docs is 6000ms. I found simply increasing it to 15000ms caused the cluster to be rock solid.

Documentation for 0.11.0 Kafka version: https://kafka.apache.org/0110/documentation.html

like image 155
mjuarez Avatar answered Sep 20 '22 07:09

mjuarez