What's the exact reason to have heartbeat failure for group because it's rebalancing ? What's the reason for rebalance where all the consumers in group are up ?
Thank you.
Heartbeats are the basic mechanism to check if all consumers are still up and running. If you get a heartbeat failure because the group is rebalancing, it indicates that your consumer instance took too long to send the next heartbeat and was considered dead and thus a rebalance got triggered.
If you want to prevent this from happening, you can either increase the timeout (session.timeout.ms
), or make sure your consumer sends heartbeat more often (heartbeat.interval.ms
). Heartbeats are basically embedded in poll()
, thus, you need to make sure you call poll frequently enough. This can usually be achieved by limit the number of records a single poll returns via max.poll.records
(to shorten the time it takes to process all data that got fetched).
Update
Since Kafka 0.10.1, heartbeats are sent in a background thread, and not when poll()
is called (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread). In this new design, configuration session.timeout.ms
and heartbeat.interval.ms
are still the same. Additionally, there is max.poll.interval.ms
that determines how often poll()
must be called. If you miss to call poll()
within max.poll.interval.ms
, the heartbeat thread assume that the processing thread died, and will send a leave-group-request that will trigger a rebalance, and the heartbeat thread will stop sending heartbeats afterwards. If you processing thread is ok but just slow, the next call to poll()
will initiate another rebalance to re-join the group again.
For more details, cf. Difference between session.timeout.ms and max.poll.interval.ms for Kafka >= 0.10.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With