Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spring Kafka producers throwing TimeoutExceptions

Problem

I have a Kafka setup with three brokers in Kubernetes, set up according to the guide at https://github.com/Yolean/kubernetes-kafka. The following error message appears when producing messages from a Java client.

2018-06-06 11:15:44.103 ERROR 1 --- [ad | producer-1] o.s.k.support.LoggingProducerListener    : Exception thrown when sending a message with key='null' and payload='[...redacted...]':
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for topicname-0: 30001 ms has passed since last append

Detailed setup

The listeners are set up to allow SSL producers/consumers from the outside world:

advertised.host.name = null
advertised.listeners = OUTSIDE://kafka-0.mydomain.com:32400,PLAINTEXT://:9092
advertised.port = null
listener.security.protocol.map = PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,OUTSIDE:SSL
listeners = OUTSIDE://:9094,PLAINTEXT://:9092
inter.broker.listener.name = PLAINTEXT
host.name =
port.name = 9092

The OUTSIDE listeners are listening on kafka-0.mydomain.com, kafka-1.mydomain.com, etc. The plaintext listeners are listening on any IP, since they are cluster-local to Kubernetes.

The producer settings:

kafka:
  bootstrap-servers: kafka.mydomain.com:9092
  properties:
    security.protocol: SSL
   producer:
    batch-size: 16384
    buffer-memory: 1048576 # 1MB
    retries: 1
    ssl:
      key-password: redacted
      keystore-location: file:/var/private/ssl/kafka.client.keystore.jks
      keystore-password: redacted
      truststore-location: file:/var/private/ssl/kafka.client.truststore.jks
      truststore-password: redacted

In addition I set linger.ms to 100 in code, which forces messages to be transmitted within 100ms. Linger time is set intentionally low, because the use case requires minimal delays.

Analysis

  • The errors started appearing when the broker was moved moved to SSL.
  • On the server side everything is running as expected, there are no errors in the log and I can connect to the broker manually with a Kafka client tool.
  • The errors appear intermittently: sometimes it sends 30+ messages per second, sometimes it sends nothing at all. It may work like a charm for hours and then just spam timeouts for a little while.
  • Clocks for the client and server are in sync (UTC).
  • CPU is consistently around 20% for both the producing and server side.

What could it be?

like image 213
Jodiug Avatar asked Jun 06 '18 16:06

Jodiug


1 Answers

This problem normally occurs when the producer is faster than the brokers, the reason why this happens with your setup seems to be that the SSL needs extra CPU and that may slow down the brokers. But anyway check the following:

  • Check if you are producing message at the same speed, according what you are saying seems that you are having spikes.
  • Another possibility is that other kafka clients in the cluster (producer or consumers), which not necessarily uses the same topic, makes this to happen because overloads the brokers (check brokers cpu/network).

To minimize whatever causes this retention you should increase the buffer-memory to more than 32MB, think that 32MB is the default and you are setting this lower. The lower you have, the easy is that the buffer gets full and if this happens it will block at most max.block.ms, and a request will timeout after request.timeout.ms.

Another parameter that you should increase is batch-size, this parameter is in bytes, not in number of messages. Also linger.ms should be increased, in case this producer messages are created in user request time, do not increase very much, a good choice could be 1-4 ms.

Messages will be send when the batch.size gets full or takes longer than linger.ms to have more data than the batch.size. Big batches increases the throughput in normal cases, but if the linger is too low it doesn't help, because you will send before you have enough data to get the batch.size.

Also recheck on producer logs that the properties are loaded correctly.

like image 75
padilo Avatar answered Nov 01 '22 00:11

padilo