Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Irregular socket errors (10054) on Windows application

I am working on a Windows (Microsoft Visual C++ 2005) application that uses several processes running on different hosts in an intranet.

Processes communicate with each other using TCP/IP. Different processes can be on the same host or on different hosts (i.e. the communication can be both within the same host or between different hosts).

We have currently a bug that appears irregularly. The communication seems to work for a while, then it stops working. Then it works again for some time.

When the communication does not work, we get an error (apparently while a process was trying to send data). The call looks like this:

send(socket, (char *) data, (int) data_size, 0);

By inspecting the error code we get from

WSAGetLastError()

we see that it is an error 10054. Here is what I found in the Microsoft documentation (see here):

WSAECONNRESET
10054

Connection reset by peer.

An existing connection was forcibly closed by the remote host. This normally
results if the peer application on the remote host is suddenly stopped, the
host is rebooted, the host or remote network interface is disabled, or the
remote host uses a hard close (see setsockopt for more information on the
SO_LINGER option on the remote socket). This error may also result if a
connection was broken due to keep-alive activity detecting a failure while
one or more operations are in progress. Operations that were in progress
fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.

So, as far as I understand, the connection was interrupted by the receiving process. In some cases this error is (AFAIK) correct: one process has terminated and is therefore not reachable. In other cases both the sender and receiver are running and logging activity, but they cannot communicate due to the above error (the error is reported in the logs).

My questions.

  • What does the SO_LINGER option mean?
  • What is a keep-alive activity and how can it break a connection?
  • How is it possible to avoid this problem or recover from it?

Regarding the last question. The first solution we tried (actually, it is rather a workaround) was resending the message when the error occurs. Unfortunately, the same error occurs over and over again for a while (a few minutes). So this is not a solution.

At the moment we do not understand if we have a software problem or a configuration issue: maybe we should check something in the windows registry?

One hypothesis was that the OS runs out of ephemeral ports (in case connections are closed but ports are not released because of TcpTimedWaitDelay), but by analyzing this issue we think that there should be plenty of them: the problem occurs even if messages are not sent too frequently between processes. However, we still are not 100% sure that we can exclude this: can ephemeral ports get lost in some way (???)

Another detail that might help is that sending and receiving occurs in each process concurrently in separate threads: are there any shared data structures in the TCP/IP libraries that might get corrupted?

What is also very strange is that the problem occurs irregularly: communication works OK for a few minutes, then it does not work for a few minutes, then it works again.

Thank you for any ideas and suggestions.

EDIT

Thanks for the hints confirming that the only possible explanation was a connection closed error. By further analysis of the problem, we found out that the server-side process of the connection had crashed / had been terminated and had been restarted. So there was a new server process running and listening on the correct port, but the client had not detected this and was still trying to use the old connection. We now have a mechanism to detect such situations and reset the connection on the client side.

like image 304
Giorgio Avatar asked Jun 12 '12 13:06

Giorgio


People also ask

How do I fix network socket error 10054?

Manually resetting the device and calling the Internet service provider (ISP) can help get the Internet back online. In more serious cases, the device may have to be repaired or replaced. Another reason for this error is that the user is using a proxy server to mask his computer address.

What is socket error 10054 reset by peer?

10054 means: An existing connection was forcibly closed by the remote server or application. This normally results if the remote server/application is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close.

What is a Windows socket error?

This error occurs if an application attempts to bind a socket to an IP address/port that has already been used for an existing socket, or a socket that was not closed properly, or one that is still in the process of closing.


2 Answers

That error means that the connection was closed by the remote site. So you cannot do anything on your programm except to accept that the connection is broken.

like image 188
rekire Avatar answered Sep 19 '22 02:09

rekire


I was facing this problem for some days recently and found out that Adobe Acrobat Reader update was the culprit. As soon as you completely uninstall Adobe from the system everything returns back to normal.

like image 42
Alexander Galkin Avatar answered Sep 19 '22 02:09

Alexander Galkin