Our (Linux) server used the option {active, once}
with it's sockets, and there were {tcp_error, Socket, etimedout}
messages poping up. I know this may be caused by bad network conditions, but there was something strange about it.
TCP keepalive was enabled system-wide on our machine, and the actual option values were:
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
Which means the sockets would timeout in at least 20 minutes, I believe. But strangely, our processes received {tcp_error, Socket, etimedout}
in less than 10 seconds.
I was wondering, counld it be triggered by the gen_tcp:send(...)
operations? And then I found it impossible because the send operations were all synchronous, they'd fail instantly.
So, my question is, where did the etimedout
message come from? Or what triggered it exactly? I goofed around the C source of Erlang VM, especially inet_drv.c
, but no conclusion yet.
Thanks.
A tcpdump capture showed that it was the timeout event from TCP retransmissions.
Our server machine had /proc/sys/net/ipv4/tcp_retries2
set to 5, which would lead to disconnection in 5 retransmissions, while this value defaults to 15 on developer machines, so we couldn't reproduce the problem locally.
Returning from gen_tcp:send(...)
(or equivalent APIs in other languages) only means that the packet is accepted by the TCP stack, but there's no guarantee that it could reach the peer, and errors may bail out when you're blocked on other operations.
Found some brief description about TCP retransmissions here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With