We have a 3 node Kafka cluster deployment, with a total of 35 topics with 50 partitions each. In total, we have configured the replication factor=2
.
We are seeing a very strange problem that intermittently Kafka node stops responding with error:
ERROR Error while accepting connection (kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:460)
at kafka.network.Acceptor.run(SocketServer.scala:403)
at java.lang.Thread.run(Thread.java:745)
We have deployed the latest Kafka version and using spring-kafka as client:
kafka_2.12-2.1.0 (CentOS Linux release 7.6.1810 (Core))
lsof -p <kafka_pid>|wc -l
, we get the total number of open descriptors as around 7000 only.lsof|grep kafka|wc -l
, we get around 1.5 Million open FD's. We have checked they are all belonging to Kafka process only.lsof|grep kafka|wc -l
comes back to 7000.We have tried setting the file limits to very large, but still we get this issue. Following is the limit set for the kafka process:
cat /proc/<kafka_pid>/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 513395 513395 processes
Max open files 500000 500000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 513395 513395 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
We have few questions here:
lsof
and lsof -p
in centos
6 and centos 7?Edit 1: Seems like we are hitting the Kafka issue: https://issues.apache.org/jira/browse/KAFKA-7697
We will plan to downgrade the Kafka version to 2.0.1.
During a broker outage, all partition replicas on the broker become unavailable, so the affected partitions' availability is determined by the existence and status of their other replicas. If a partition has no additional replicas, the partition becomes unavailable.
New Brokers Can Impact the Performance Pushing a new Kafka broker into production can potentially impact the performance and cause serious latency and missing file problems. The broker can work properly before the partition reassign process is completed.
Conclusion: Kafka can handle the volume With 5 brokers it makes autoscaling the Kafka cluster to 4 vCPU and 12 GB memory we are able to sustain 200 million events per hour or 55,000 events per second simple.
A maximum of 16384 GiB of storage per broker. A cluster that uses IAM access control can have up to 3000 TCP connections per broker at any given time. To increase this limit, you can adjust the listener.
Based on the earlier update of the asker he found that he was hitting https://issues.apache.org/jira/browse/KAFKA-7697
A quick check now shows that it is resolved, and based on the jira it seems that the solution for this problem is using Kafka 2.1.1 and above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With