I am work on a high load tcp application with Java Netty, which expect to arrive 300k concurrent TCP connections.
It works perfect on test server, arrive 300k connections, but when deploy to production server, it only can support 65387 connections, after arrive this number, client will throw out a "java.io.IOException: Connection reset by peer" exceptions. I try many times, every time, when connections up to 65387, client will can't create connection.
The network capture as bellow, 10.95.196.27 is server, 10.95.196.29 is client :
16822 12:26:12.480238 10.95.196.29 10.95.196.27 TCP 74 can-ferret > http [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=872641174 TSecr=0 WS=128
16823 12:26:12.480267 10.95.196.27 10.95.196.29 TCP 66 http > can-ferret [SYN, ACK] Seq=0 Ack=1 Win=2920 Len=0 MSS=1460 SACK_PERM=1 WS=1024
16824 12:26:12.480414 10.95.196.29 10.95.196.27 TCP 60 can-ferret > http [ACK] Seq=1 Ack=1 Win=14720 Len=0
16825 12:26:12.480612 10.95.196.27 10.95.196.29 TCP 54 http > can-ferret [FIN, ACK] Seq=1 Ack=1 Win=3072 Len=0
16826 12:26:12.480675 10.95.196.29 10.95.196.27 HTTP 94 Continuation or non-HTTP traffic
16827 12:26:12.480697 10.95.196.27 10.95.196.29 TCP 54 http > can-ferret [RST] Seq=1 Win=0 Len=0
The exception cause by after client 3 handshake to server, server send a RST package to client, and the new connection was broken.
client side exception stack as bellow:
16:42:05.826 [nioEventLoopGroup-1-15] WARN i.n.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the end of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_25]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_25]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225) ~[na:1.7.0_25]
at sun.nio.ch.IOUtil.read(IOUtil.java:193) ~[na:1.7.0_25]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375) ~[na:1.7.0_25]
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:259) ~[netty-all-4.0.0.Beta3.jar:na]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:885) ~[netty-all-4.0.0.Beta3.jar:na]
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:226) ~[netty-all-4.0.0.Beta3.jar:na]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:72) ~[netty-all-4.0.0.Beta3.jar:na]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:460) ~[netty-all-4.0.0.Beta3.jar:na]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:424) ~[netty-all-4.0.0.Beta3.jar:na]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:360) ~[netty-all-4.0.0.Beta3.jar:na]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:103) ~[netty-all-4.0.0.Beta3.jar:na]
at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
Sever side have not exceptions.
I had try turning some sysctl item as bellow to support huge connections, but its useless:
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 4096 33554432
net.ipv4.tcp_wmem = 4096 4096 33554432
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_max_tw_buckets = 360000
net.core.netdev_max_backlog = 4096
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_max_syn_backlog = 4096
net.netfilter.nf_conntrack_max = 3000000
net.nf_conntrack_max = 3000000
net.core.somaxconn = 327680
The max open fd already set to 999999
linux-152k:~ # ulimit -n
999999
The OS release is SUSE Linux Enterprise Server 11 SP2 with 3.0.13 kernel:
linux-152k:~ # cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 2
linux-152k:~ # uname -a
Linux linux-152k 3.0.13-0.27-default #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b) x86_64 x86_64 x86_64 GNU/Linux.
The dmesg have not any error information, CPU and Memory keep low level, every thing looks good, just server reset connection from client.
We have a test server which was SUSE Linux Enterprise Server 11 SP1 with 2.6.32 kernel, it works well, can support up to 300k connections.
I think maybe some kernel or security limit cause this, but I can't find it, any suggestions or any way to get some debug informations of why server send RST? Thanks.
Santal, I've just came across the following link, and it seems it can give an answer to your question: What is the theoretical maximum number of open TCP connections that a modern Linux box can have
Finally got the root cause. Simply said, it was a JDK bug, please refer to http://mail.openjdk.java.net/pipermail/nio-dev/2013-September/002284.html which cause NPE when fd > 64 * 1024.
After upgrade to JDK7_45, everything works great now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With