Application description :
According to the above sections, here's the tunning for my fiber http client ( which of course I'm using a single instance of ):
PoolingNHttpClientConnectionManager connectionManager =
new PoolingNHttpClientConnectionManager(
new DefaultConnectingIOReactor(
IOReactorConfig.
custom().
setIoThreadCount(16).
setSoKeepAlive(false).
setSoLinger(0).
setSoReuseAddress(false).
setSelectInterval(10).
build()
)
);
connectionManager.setDefaultMaxPerRoute(32768);
connectionManager.setMaxTotal(131072);
FiberHttpClientBuilder fiberClientBuilder = FiberHttpClientBuilder.
create().
setDefaultRequestConfig(
RequestConfig.
custom().
setSocketTimeout(1500).
setConnectTimeout(1000).
build()
).
setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE).
setConnectionManager(connectionManager).
build();
ulimits for open-files are set super high ( 131072 for both soft and hard values )
kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1
Problem description
Pending
stat clibms to a sky-rocketing 30K pending connection requests as welllsof
ing the java process , I can see it has tens of thousands of file descriptors , almost all of them are in CLOSE_WAIT ( which makes sense , as the I/O reactor thread die/stop functioning and never get to actually closing themQuestions
Forgot to answer this, but I got what's going on roughly a week after posting the question :
There was some sort of miss-configuration that caused the io-reactor to spawn with only 2 threads.
Even after providing more reactor threads, the issue persisted. It turns out that our outgoing requests were mostly SSL. Apache SSL connection handling propagates the core handling to the JVM's SSL facilities which simply - are not efficient enough for handling thousands of SSL connections requests per second. Being more specific, some methods inside SSLEngine(If I recall correctly) are synchronized. doing thread-dumps under high loads shows the IORecator threads blocking each-other while trying to open SSL connections.
Even trying to create a pressure release valve in the form of connection lease-timeout didn't work because the backlogs created were to large, rendering the application useless.
Offloading SSL outgoing requests handling to nginx performed even worse - because the remote endpoints are terminating the requests preemptively, SSL client session cache could not be used ( same goes for the JVM implementation ).
Wound up putting a semaphore in-front of the entire module, limiting the whole thing to ~6000 at any given moment, which solved the issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With