Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Asynchronous auto-commit of offsets fails

I have a question on Kafka auto-commit mechanism. I'm using Spring-Kafka with auto-commit enabled. As an experiment, I disconnected my consumer's connection to Kafka for 30 seconds while the system was idle (no new messages in the topic, no messages being processed). After reconnecting I got a few messages like so:

Asynchronous auto-commit of offsets {cs-1915-2553221872080030-0=OffsetAndMetadata{offset=19, leaderEpoch=0, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

First, I don't understand what is there to commit? The system was idle (all previous messages were already committed). Second, the disconnection time was 30 seconds, much less than the 5 minutes (300000 ms) max.poll.interval.ms Third, in an uncontrolled failure of Kafka I got at least 30K messages of this type, which was resolved by restarting the process. Why is this happening?

I'm listing here my consumer configuration:

allow.auto.create.topics = true
        auto.commit.interval.ms = 100
        auto.offset.reset = latest
        bootstrap.servers = [kafka1-eu.dev.com:9094, kafka2-eu.dev.com:9094, kafka3-eu.dev.com:9094]
        check.crcs = true
        client.dns.lookup = default
        client.id =
        client.rack =
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = true
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = feature-cs-1915-2553221872080030
        group.instance.id = null
        heartbeat.interval.ms = 3000
        interceptor.classes = []
        internal.leave.group.on.close = true
        isolation.level = read_uncommitted
        key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 500
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = null
        sasl.jaas.config = null
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = GSSAPI
        security.protocol = SSL
        send.buffer.bytes = 131072
        session.timeout.ms = 15000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        ssl.endpoint.identification.algorithm = https
        ssl.key.password = [hidden]
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.location = /home/me/feature-2553221872080030.keystore
        ssl.keystore.password = [hidden]
        ssl.keystore.type = JKS
        ssl.protocol = TLS
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.location = /home/me/feature-2553221872080030.truststore
        ssl.truststore.password = [hidden]
        ssl.truststore.type = JKS
        value.deserializer = class org.springframework.kafka.support.serializer.ErrorHandlingDeserializer2
like image 603
Marumba Avatar asked Oct 24 '25 09:10

Marumba


1 Answers

First, I don't understand what is there to commit?

You are right, there is nothing new to commit if no new data is flowing. However, having auto.commit enabled and your consumer is still running (even without being able to connect to broker) the poll method is still responsible of the following steps:

  • Fetch messages from assigned partitions
  • Trigger partition assignment (if necessary)
  • Commit offsets if auto offset commit is enabled

Together with your interval of 100ms (see auto.commit.intervals) the consumer still tries to asynchronously commit the (non changing) offset position of the consumer.

Second, the disconnection time was 30 seconds, much less than the 5 minutes (300000 ms) max.poll.interval.ms

It is not the max.poll.interval that is causing the rebalance but rather the combination of your heartbeat.interval.ms setting and the session.timeout.ms. Your consumer sends in a background thread heartbeats based on the interval setting, in your case 3 seconds. If no heartbeats are received by the broker before the expiration of this session timeout (in your case 15 seconds), then the broker will remove this client from the group and initiate a rebalance.

A more detailed description of the configuration I mentioned are given in the Kafka documentation on Consumer Configs

Third, in an uncontrolled failure of Kafka I got at least 30K messages of this type, which was resolved by restarting the process. Why is this happening?

That seems to be a combination of the first two questions, where heartbeats cannot be sent and still the consumer is trying to commit through the contiuously called poll method.

As @GaryRussell mentioned in his comment, I would be careful to use auto.commit.enabled and rather take the control over the Offset Management to yourself.

like image 88
Michael Heil Avatar answered Oct 26 '25 03:10

Michael Heil