We are getting random <code>NetworkExceptions</code> and <code>TimeoutExceptions</code> in our production environment: <pre class="prettyprint"><code>Brokers: 3 Zookeepers: 3 Servers: 3 Kafka: 0.10.0.1 Zookeeeper: 3.4.3 </code></pre> We are occasionally getting this exception in my producer logs: <blockquote> Expiring 10 record(s) for TOPIC:XXXXXX: 5608 ms has passed since batch creation plus linger time. </blockquote> Number of milliseconds in such error messages keep changing. Sometimes its ~5 seconds other times it's up to ~13 seconds! And very rarely we get: <pre class="prettyprint"><code>NetworkException: Server disconnected before response received. </code></pre> Cluster consists of 3 brokers and 3 zookeepers. Producer server and Kafka cluster are in same network. I am making synchronous calls. There's a web service to which multiple user requests call to send their data. Kafka web service has one Producer object which does all the sending. Producer's Request timeout was 1000ms initially that has been changed to 15000ms (15 seconds). Even after increasing timeout period <code>TimeoutExceptions</code> are still showing up in error logs. What can be the reason?

It is a bit tricky to find the root cause, I'll drop my experience on that, hopefully someone may find it useful. In general it can be a network issue or too much network flood in combination with <code>ack=ALL</code>. Here a diagram that explain the <code>TimeoutException</code> from Kafka KIP-91 at he time of writing (still applicable till 1.1.0): <img src="https://i.stack.imgur.com/yBQhD.png" alt="enter image description here"> Excluding network configuration issues or errors, this are the properties you can adjust depending on your scenario in order to mitigate or solve the problem: <ul> <li>The buffer.memory controls the total memory available to a producer for buffering. If records get sent faster than they can be transmitted to Kafka then and this buffer will get exceeded then additional send calls block up to max.block.ms after then Producer throws a <code>TimeoutException</code>.</li> <li>The max.block.ms has already a high value and I do not suggest to further increment it. buffer.memory has the default value of 32MB and depending on you message size you may want to increase it; if necessary increase the jvm heap space.</li> <li>Retries define how many attempts to resend the record in case of error before giving up. If you are using zero retries you can try to mitigate the problem by increasing this value, beware record order is not guarantee anymore unless you set max.in.flight.requests.per.connection to 1.</li> <li>Records are sent as soon as the batch size is reached or the linger time is passed, whichever comes first. if batch.size (default 16kb) is smaller than the maximum request size perhaps you should use a higher value. In addition, change linger.ms to a higher value such as 10, 50 or 100 to optimize the use of the batch and the compression. This will cause less flood in the network and optimize compression if you are using it.</li> </ul> There is not an exact answer on this kind of issues since they depends also on the implementation, in my case experimenting with the values above helped.

Kafka Producer NetworkException and Timeout Exceptions

Tags:

java

apache-kafka

We are getting random NetworkExceptions and TimeoutExceptions in our production environment:

Brokers: 3
Zookeepers: 3
Servers: 3
Kafka: 0.10.0.1
Zookeeeper: 3.4.3

We are occasionally getting this exception in my producer logs:

Expiring 10 record(s) for TOPIC:XXXXXX: 5608 ms has passed since batch creation plus linger time.

Number of milliseconds in such error messages keep changing. Sometimes its ~5 seconds other times it's up to ~13 seconds!

And very rarely we get:

NetworkException: Server disconnected before response received.

Cluster consists of 3 brokers and 3 zookeepers. Producer server and Kafka cluster are in same network.

I am making synchronous calls. There's a web service to which multiple user requests call to send their data. Kafka web service has one Producer object which does all the sending. Producer's Request timeout was 1000ms initially that has been changed to 15000ms (15 seconds). Even after increasing timeout period TimeoutExceptions are still showing up in error logs.

What can be the reason?

397

asked Nov 06 '17 17:11

Shades88

Video Answer

1 Answers

It is a bit tricky to find the root cause, I'll drop my experience on that, hopefully someone may find it useful. In general it can be a network issue or too much network flood in combination with ack=ALL. Here a diagram that explain the TimeoutException from Kafka KIP-91 at he time of writing (still applicable till 1.1.0):

enter image description here

Excluding network configuration issues or errors, this are the properties you can adjust depending on your scenario in order to mitigate or solve the problem:

The buffer.memory controls the total memory available to a producer for buffering. If records get sent faster than they can be transmitted to Kafka then and this buffer will get exceeded then additional send calls block up to max.block.ms after then Producer throws a TimeoutException.
The max.block.ms has already a high value and I do not suggest to further increment it. buffer.memory has the default value of 32MB and depending on you message size you may want to increase it; if necessary increase the jvm heap space.
Retries define how many attempts to resend the record in case of error before giving up. If you are using zero retries you can try to mitigate the problem by increasing this value, beware record order is not guarantee anymore unless you set max.in.flight.requests.per.connection to 1.
Records are sent as soon as the batch size is reached or the linger time is passed, whichever comes first. if batch.size (default 16kb) is smaller than the maximum request size perhaps you should use a higher value. In addition, change linger.ms to a higher value such as 10, 50 or 100 to optimize the use of the batch and the compression. This will cause less flood in the network and optimize compression if you are using it.

There is not an exact answer on this kind of issues since they depends also on the implementation, in my case experimenting with the values above helped.

129

answered Sep 22 '22 05:09

Paizo

Related questions
                            
                                Merge maps with recursive nested maps in Groovy
                            
                                How to get full message body in Gmail?
                            
                                How to set timeout to JAX-RS client with CXF
                            
                                Building phone gap app for Android failing
                            
                                Must the "default" case come last in a switch?
                            
                                Regular expression for persian(arabic) letters without any numbers
                            
                                Get spring bean via context using generic
                            
                                TestNG parallel Execution with DataProvider
                            
                                Paradox instantiation
                            
                                Jdbi - how to bind a list parameter in Java?
                            
                                UnknownHostException: name or service not known
                            
                                Filtering upwards path traversal in Java (or Scala) [closed]
                            
                                What exactly is getGlobalVisibleRect()?
                            
                                Exception 0xC0000005 from JNI_CreateJavaVM (jvm.dll)
                            
                                Can JavaFX be used on Raspberry Pi
                            
                                Is Java 8 findFirst().isPresent() more efficient than count() > 0?
                            
                                Integration test with TestRestTemplate for Multipart POST request returns 400
                            
                                Gradle application plugin with multiple main classes
                            
                                Round Corners in java fx pane
                            
                                Reactive java method hide()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With