Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Very strange disconnections with TCP sockets

In short: the server calls ::send() successfully but the data is not going out on the cable. The client send its quit command after some seconds because it does not receive any thing and that command is correctly received by the server.

In details: The server send a command to its clients every 1/10 of second and an heartbeat each second. The clients only return an ack for the heartbeat. We modified the server application to log every command sent and received and we have recorded the pc's traffic with Wireshark. We can match every logged command to a TCP packet, until the problem kicks in. The problem only affects one client at a time. The data continues to flow normally with the other clients. The connection usually works a few minutes before getting into trouble. The connection should works from the moment the client boots until it is closed (ie days).

When the problem occurs, the log file contains the expected commands, but the Wireshark dump contains nothing. The image below show the traffic with one client. The red line is when the traffic stop, but the server continues to call ::send() successfully. TCP traffic when the problem kicks in After about 4 sec, the client timeout and it close the connection. It send a quit command and the server receive it normally.

What puzzle me even more is the packet containing the quit command is not acknowledged with a TCP ACK packet. It is as if the TCP connection is completely jammed on the sending end. The retransmission is an effect of that jam, but even the TCP SYN to establish a new connection is not correctly processed and does not get a simple TCP ACK.

After about 30 sec, the problem disappears and the SYN packet is finally accepted and the communication continues with the new connection.

This was tested on various Windows versions. During the tests, a remote desktop session was used and it never got disconnected by the same problem. It stays connected for hours without any problem. When the client pass through a wireless bridge, the problem is more frequent. We used Wireshark on both side of the wireless end-point and we see no retransmission or packet loss that can explain higher disconnection rate.

When many clients are connected to the same bridge, they do not fail at the same time. Only one at a time. So wireless noise does not seem to be an explication. We can see some retransmission in the Wireshark dump, but the communication continues as usual and there is no retransmission before the problem happens. An access point is connected to the server's switch. The client pc and server pc do not use a wireless network card.

For a long time, we tough the occasional disconnection was caused by the network, but more and more installations are wireless and the disconnections are now so frequent that they cause problems to the users.

We tried with and without Windows firewall enabled. We added the port exception even when the firewall is disabled. Neither the client or server have an anti-virus.

like image 926
PRouleau Avatar asked Jul 16 '12 13:07

PRouleau


1 Answers

http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive may help.

like image 112
enthusiasticgeek Avatar answered Oct 18 '22 06:10

enthusiasticgeek