I am working on a Windows (Microsoft Visual C++ 2005) application that uses several processes running on different hosts in an intranet.
Processes communicate with each other using TCP/IP. Different processes can be on the same host or on different hosts (i.e. the communication can be both within the same host or between different hosts).
We have currently a bug that appears irregularly. The communication seems to work for a while, then it stops working. Then it works again for some time.
When the communication does not work, we get an error (apparently while a process was trying to send data). The call looks like this:
send(socket, (char *) data, (int) data_size, 0);
By inspecting the error code we get from
WSAGetLastError()
we see that it is an error 10054. Here is what I found in the Microsoft documentation (see here):
WSAECONNRESET
10054
Connection reset by peer.
An existing connection was forcibly closed by the remote host. This normally
results if the peer application on the remote host is suddenly stopped, the
host is rebooted, the host or remote network interface is disabled, or the
remote host uses a hard close (see setsockopt for more information on the
SO_LINGER option on the remote socket). This error may also result if a
connection was broken due to keep-alive activity detecting a failure while
one or more operations are in progress. Operations that were in progress
fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.
So, as far as I understand, the connection was interrupted by the receiving process. In some cases this error is (AFAIK) correct: one process has terminated and is therefore not reachable. In other cases both the sender and receiver are running and logging activity, but they cannot communicate due to the above error (the error is reported in the logs).
My questions.
Regarding the last question. The first solution we tried (actually, it is rather a workaround) was resending the message when the error occurs. Unfortunately, the same error occurs over and over again for a while (a few minutes). So this is not a solution.
At the moment we do not understand if we have a software problem or a configuration issue: maybe we should check something in the windows registry?
One hypothesis was that the OS runs out of ephemeral ports (in case connections are closed but ports are not released because of TcpTimedWaitDelay), but by analyzing this issue we think that there should be plenty of them: the problem occurs even if messages are not sent too frequently between processes. However, we still are not 100% sure that we can exclude this: can ephemeral ports get lost in some way (???)
Another detail that might help is that sending and receiving occurs in each process concurrently in separate threads: are there any shared data structures in the TCP/IP libraries that might get corrupted?
What is also very strange is that the problem occurs irregularly: communication works OK for a few minutes, then it does not work for a few minutes, then it works again.
Thank you for any ideas and suggestions.
EDIT
Thanks for the hints confirming that the only possible explanation was a connection closed error. By further analysis of the problem, we found out that the server-side process of the connection had crashed / had been terminated and had been restarted. So there was a new server process running and listening on the correct port, but the client had not detected this and was still trying to use the old connection. We now have a mechanism to detect such situations and reset the connection on the client side.
Manually resetting the device and calling the Internet service provider (ISP) can help get the Internet back online. In more serious cases, the device may have to be repaired or replaced. Another reason for this error is that the user is using a proxy server to mask his computer address.
10054 means: An existing connection was forcibly closed by the remote server or application. This normally results if the remote server/application is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close.
This error occurs if an application attempts to bind a socket to an IP address/port that has already been used for an existing socket, or a socket that was not closed properly, or one that is still in the process of closing.
That error means that the connection was closed by the remote site. So you cannot do anything on your programm except to accept that the connection is broken.
I was facing this problem for some days recently and found out that Adobe Acrobat Reader update was the culprit. As soon as you completely uninstall Adobe from the system everything returns back to normal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With