Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large amount of multiprocessing.Process causing deadlock

Context

I need to run a multiprocessing.Process inside a multiprocessing.ThreadPool. It seems weird at first but it is the only way I found to deals with segfault that could occurs because I am using a c++ shared library. If a segfault append, the process is killed and I can check the process.exitcode and deal with that.

Problem

After a while, a deadlock append when I am trying to join the process.

Here is a simple version a my code:

import sys, time, multiprocessing
from multiprocessing.pool import ThreadPool

def main():
    # Launch 8 workers
    pool = ThreadPool(8)
    it = pool.imap(run, range(500))
    while True:
        try:
            it.next()
        except StopIteration:
            break

def run(value):
    # Each worker launch it own Process
    process = multiprocessing.Process(target=run_and_might_segfault,     args=(value,))
    process.start()

    while process.is_alive():
        sys.stdout.write('.')
        sys.stdout.flush()
        time.sleep(0.1)

    # Will never join after a while, because of a mystery deadlock
    process.join()

    # Deals with process.exitcode to log errors

def run_and_might_segfault(value):
    # Load a shared library and do stuff (could throw c++ exception, segfault ...)
    print(value)

if __name__ == '__main__':
    main()

And here is a possible output:

➜  ~ python m.py
..0
1
........8
.9
.......10
......11
........12
13
........14
........16
........................................................................................

As you can see, process.is_alive() is alway true after few iterations, the process will never join.

If I CTRL-C the script a get this stacktrace:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 680, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "m.py", line 30, in <module>
    main()
  File "m.py", line 9, in main
    it.next()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5    /lib/python3.5/multiprocessing/pool.py", line 684, in next
    self._cond.wait(timeout)
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5    /lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5    /lib/python3.5/multiprocessing/popen_fork.py", line 29, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

PS Using python 3.5.2 on macos.

Every kind of help is appreciate, thanks.

Edit

I tried using python 2.7, and it is working well. May be a python 3.5 issue only?

like image 633
Hadhoke Avatar asked Oct 05 '16 22:10

Hadhoke


People also ask

What is deadlock in multiprocessing?

A deadlock is a condition that may happen in a system composed of multiple processes that can access shared resources. A deadlock is said to occur when two or more processes are waiting for each other to release a resource. None of the processes can make any progress.

Is multiprocessing queue process safe?

Using a queue in multiprocessingOperations with a queue are process-safe. The multiprocessing Queue implements all the methods of queue. Queue except for task_done() and join() .

Is multiprocessing value thread safe?

Value and multiprocessing.These shared objects will be process and thread-safe. This means that multiple processes may access and change the values of shared ctypes without fear of race conditions.


1 Answers

The problem is also reproduced on the latest build of CPython - Python 3.7.0a0 (default:4e2cce65e522, Oct 13 2016, 21:55:44).

If you attach to one of the stuck processes with gdb, you'll see that it's trying to acquire a lock in sys.stdout.flush() call:

(gdb) py-list
 263                import traceback
 264                sys.stderr.write('Process %s:\n' % self.name)
 265                traceback.print_exc()
 266            finally:
 267                util.info('process exiting with exitcode %d' % exitcode)
>268                sys.stdout.flush()
 269                sys.stderr.flush()
 270
 271            return exitcode

Python level backtrace looks like this:

 (gdb) py-bt
 Traceback (most recent call first):
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/process.py", line 268, in _bootstrap
     sys.stdout.flush()
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/popen_fork.py", line 74, in _launch
     code = process_obj._bootstrap()
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/popen_fork.py", line 20, in __init__
     self._launch(process_obj)
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/context.py", line 277, in _Popen
     return Popen(process_obj)
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/context.py", line 223, in _Popen
     return _default_context.get_context().Process._Popen(process_obj)
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/process.py", line 105, in start
     self._popen = self._Popen(self)
   File "deadlock.py", line 17, in run
     process.start()
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/pool.py", line 119, in worker
     result = (True, func(*args, **kwds))
   File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 864, in run
     self._target(*self._args, **self._kwargs)
   File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 916, in _bootstrap_inner
     self.run()
   File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 884, in _bootstrap
     self._bootstrap_inner()

At the interpreter level it looks like:

(gdb) frame 6

(gdb) list
287        return 0;
288    }
289    relax_locking = (_Py_Finalizing != NULL);
290    Py_BEGIN_ALLOW_THREADS
291    if (!relax_locking)
292        st = PyThread_acquire_lock(self->lock, 1);
293    else {
294        /* When finalizing, we don't want a deadlock to happen with daemon
295         * threads abruptly shut down while they owned the lock.
296         * Therefore, only wait for a grace period (1 s.). ... */

(gdb) p /x self->lock
$1 = 0xd25ce0

(gdb) p /x self->owner
$2 = 0x7f9bb2128700

Note, that from the point of view of this particular child process the lock is still owned by one of threads in the parent process (LWP 1105):

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f9bb5559440 (LWP 1102) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0xe4d340) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  2    Thread 0x7f9bb312a700 (LWP 1103) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  3    Thread 0x7f9bb2929700 (LWP 1104) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  4    Thread 0x7f9bb2128700 (LWP 1105) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  5    Thread 0x7f9bb1927700 (LWP 1106) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  6    Thread 0x7f9bb1126700 (LWP 1107) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  7    Thread 0x7f9bb0925700 (LWP 1108) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  8    Thread 0x7f9b9bfff700 (LWP 1109) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  9    Thread 0x7f9b9b7fe700 (LWP 1110) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  10   Thread 0x7f9b9affd700 (LWP 1111) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  11   Thread 0x7f9b9a7fc700 (LWP 1112) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0x7f9b80001ed0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  12   Thread 0x7f9b99ffb700 (LWP 1113) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0x7f9b84001bb0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

So it's indeed a deadlock and it happens due to the fact that you perform writes and flushing on sys.stdout in multiple threads concurrently in the original process while also creating subprocesses - by the nature of fork(2) system call children inherit the parent memory including acquired locks: fork() calls must have been performed while the lock was acquired, and even when the parent process finally releases it, the children won't see that, as each of them now has its own memory space, that was copied on write.

Thus, you need to be very careful when mixing multithreading with multiprocessing and make sure all the locks are properly released before fork(), if they are to be used in the children processes.

It's very similar to what is described in http://bugs.python.org/issue6721

Note, that if you remove the interactions with sys.stdout from your snippet, it will work correctly.

like image 140
Roman Podoliaka Avatar answered Oct 02 '22 23:10

Roman Podoliaka