Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiprocessing.Pool hangs indefinitely after close/join

I have an indeterminate production issue where a multiprocessing.Pool gets stuck and never returns from join.

I've managed to reduce the problem down to this small example, and have it somewhat reliably hang.

Working example:

#!/usr/bin/env python3
import os
import time
import multiprocessing.pool

def run_task(i):
    print(f'[{os.getpid()}] task({i}) complete')

if __name__ == '__main__':
    tasks = iter(range(10))
    processes = 4

    pool = multiprocessing.pool.Pool(processes=processes, maxtasksperchild=1)
    running = []
    while True:
        try:
            running = [ f for f in running if not f.ready() ]
            avail = processes - len(running)
            if avail:
                for _ in range(avail):
                    i = next(tasks)
                    print(f'[{os.getpid()}] add task({i})')
                    future = pool.apply_async(run_task, ( i, ))
                    running.append(future)
            else:
                time.sleep(0.1)
        except StopIteration:
            print(f'[{os.getpid()}] all tasks scheduled')
            break

    print(f'[{os.getpid()}] close and join pool')
    pool.close()
    pool.join()
    print(f'[{os.getpid()}] all done')

Presumably the issue is one of timing, as the failure is indeterminate. As such I have to run it in a loop to get it to hang (although in my experience it will hang on one of the first few iterations).

for i in {1..100}; do ./test.py; done   

The output when it hangs:

[15243] add task(0)
[15243] add task(1)
[15243] add task(2)
[15243] add task(3)
[15244] task(0) complete
[15245] task(1) complete
[15246] task(2) complete
[15247] task(3) complete
[15243] add task(4)
[15243] add task(5)
[15251] task(4) complete
[15243] add task(6)
[15243] add task(7)
[15252] task(5) complete
[15253] task(6) complete
[15243] add task(8)
[15243] add task(9)
[15243] all tasks scheduled
[15255] task(8) complete
[15256] task(9) complete
[15243] close and join pool     <-- hangs here indefinitely

gdb backtrace of the main process:

#0  0x00007fb132b7c6c2 in __GI___waitpid (pid=22857, stat_loc=0x7fff8ef55d5c, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00000000005d10e5 in os_waitpid_impl (module=<optimised out>, options=0, pid=22857) at ../Modules/posixmodule.c:6941
#2  os_waitpid.lto_priv () at ../Modules/clinic/posixmodule.c.h:2995
#3  0x000000000050a84f in _PyCFunction_FastCallDict (kwargs=<optimised out>, nargs=<optimised out>, args=<optimised out>, func_obj=0x7fb132fea0d8) at ../Objects/methodobject.c:234
#4  _PyCFunction_FastCallKeywords (kwnames=<optimised out>, nargs=<optimised out>, stack=<optimised out>, func=<optimised out>) at ../Objects/methodobject.c:294
#5  call_function.lto_priv () at ../Python/ceval.c:4851

gdb backtrace of the child process:

#0  0x00007fb1328896d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x1c68e40) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x1c68e40, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007fb1328897c8 in __new_sem_wait_slow (sem=0x1c68e40, abstime=0x0) at sem_waitcommon.c:181
#3  0x00000000005ab535 in PyThread_acquire_lock_timed (intr_flag=<optimised out>, microseconds=<optimised out>, lock=<optimised out>) at ../Python/thread_pthread.h:386
#4  PyThread_acquire_lock () at ../Python/thread_pthread.h:595
#5  0x0000000000446bf1 in _enter_buffered_busy (self=self@entry=0x7fb13307aa98) at ../Modules/_io/bufferedio.c:292
#6  0x00000000004ce743 in buffered_flush.lto_priv () at ../Python/thread_pthread.h:416

Implementation notes:

Scheduling tasks only when a worker is available:

Each individual task's priority can change while it's waiting to be executed, so I cannot just enqueue all the tasks at the beginning.

Hence the running list and checking AsyncResult.ready to determine whether I can execute another task

maxtasksperchild=1:

The tasks leak memory, so in order to reclaim the memory lost after each task has run, I cheat and use maxtasksperchild=1


Observations:

sleeping vs busy waiting:

Interestingly, if I change the time.sleep(0.1) to a busy wait, the hang goes away.

wait = time.time() + 0.1
while time.time() < wait:
    pass

Is it possible that a signal from a child process is missed during the parent's sleep?

maxtasksperchild=1:

If I reuse the original child processes, the hang goes away.


So it seems to be some interplay between the fact the processes are being destroyed after each task completes, and the parent sleeping.

As a quick fix in production I've changed to sleep to a busy wait, but it feels like an ugly hack, and I'd like to understand what exactly is causing the hang.

  • Why does the pool never return from join?
  • Why does busy waiting instead of sleeping "fix" the problem?
  • Why does reusing processes instead of creating new ones each time "fix" the problem?
like image 325
Steve Lorimer Avatar asked Nov 13 '19 18:11

Steve Lorimer


1 Answers

I think the issue is with the exception, Technically it should not be there and might already be fixed in later versions of python.

[15243] add task(4)
[15243] add task(5)
[15251] task(4) complete
[15243] add task(6)
[15243] add task(7)
[15252] task(5) complete
[15253] task(6) complete
[15243] add task(8)
[15243] add task(9)
[15243] all tasks scheduled <-- Exception Called but [15254] or task(7) is not completed
[15255] task(8) complete
[15256] task(9) complete
[15243] close and join pool     <-- hangs here indefinitely

Something happens at that point of exception call which might cause task(7) to go into a weird state, apply_async allows callbacks which means that 3.6 might be creating the threads in an unstable manner.

Block wait means your main does not sleep and might be faster in handling this. Check if increasing the wait time or using apply() makes a difference.

I am not sure why reusing "fixes" the problem but might just be that access time is faster and easier to handle.

like image 71
ghost_duke Avatar answered Oct 03 '22 17:10

ghost_duke