I have a program which:
At some point, thread (4) does a fork/exec to run another program which should connect to the socket that thread (2) is listening to. Occasionally this fails or takes an unreasonably long time, and it's extremely difficult to diagnose. If I strace the system, it appears that the fork/exec has worked, the accept has happened, the new thread (4) has been created .. but nothing happens in that thread (using strace -ff, the file for the relevant pid is blank).
Any ideas?
I came to the conclusion that it was probably this phenomenon:
http://kerneltrap.org/mailarchive/linux-kernel/2008/8/15/2950234/thread
as the bug is difficult to trigger on our development systems but is generally reported by users running on large shared machines; also the forked application starts a JVM, which itself allocates a lot of threads. The problem is also associated with the machine being loaded, and extensive memory usage (we have a machine with 128Gb of RAM and processes may be 10-100G in size).
I've been reading the O'Reilly pthreads book, which explains pthread_atfork(), and suggests the use of a "surrogate parent" process forked from the main process at startup from which subprocesses are run. It also suggests the use of a pre-created thread pool. Both of these seem like good ideas, so I'm going to implement at least one of them.
It's look like a deadlock condition. Look for blocking functions, like accept(), the problem should be there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With