Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can cause a spontaneous EPIPE error without either end calling close() or crashing?

I have an application that consists of two processes (let's call them A and B), connected to each other through Unix domain sockets. Most of the time it works fine, but some users report the following behavior:

  1. A sends a request to B. This works. A now starts reading the reply from B.
  2. B sends a reply to A. The corresponding write() call returns an EPIPE error, and as a result B close() the socket. However, A did not close() the socket, nor did it crash.
  3. A's read() call returns 0, indicating end-of-file. A thinks that B prematurely closed the connection.

Users have also reported variations of this behavior, e.g.:

  1. A sends a request to B. This works partially, but before the entire request is sent A's write() call returns EPIPE, and as a result A close() the socket. However B did not close() the socket, nor did it crash.
  2. B reads a partial request and then suddenly gets an EOF.

The problem is I cannot reproduce this behavior locally at all. I've tried OS X and Linux. The users are on a variety of systems, mostly OS X and Linux.

Things that I've already tried and considered:

  • Double close() bugs (close() is called twice on the same file descriptor): probably not as that would result in EBADF errors, but I haven't seen them.
  • Increasing the maximum file descriptor limit. One user reported that this worked for him, the rest reported that it did not.

What else can possibly cause behavior like this? I know for certain that neither A nor B close() the socket prematurely, and I know for certain that neither of them have crashed because both A and B were able to report the error. It is as if the kernel suddenly decided to pull the plug from the socket for some reason.

like image 838
Hongli Avatar asked Feb 10 '10 10:02

Hongli


2 Answers

Perhaps you could try strace as described in: http://modperlbook.org/html/6-9-1-Detecting-Aborted-Connections.html

I assume that your problem is related to the one described here: http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable

Unfortunately I'm having a similar problem myself but couldn't manage to get it fixed with the given advices. However, perhaps that SO_LINGER thing works for you.

like image 130
user206268 Avatar answered Sep 30 '22 15:09

user206268


  • shutdown() may have been called on one of the socket endpoints.

  • If either side may fork and execute a child process, ensure that the FD_CLOEXEC (close-on-exec) flag is set on the socket file descriptor if you did not intend for it to be inherited by the child. Otherwise the child process could (accidentally or otherwise) be manipulating your socket connection.

like image 30
mark4o Avatar answered Sep 30 '22 15:09

mark4o