I have an indeterminate production issue where a multiprocessing.Pool
gets stuck and never returns from join
.
I've managed to reduce the problem down to this small example, and have it somewhat reliably hang.
#!/usr/bin/env python3
import os
import time
import multiprocessing.pool
def run_task(i):
print(f'[{os.getpid()}] task({i}) complete')
if __name__ == '__main__':
tasks = iter(range(10))
processes = 4
pool = multiprocessing.pool.Pool(processes=processes, maxtasksperchild=1)
running = []
while True:
try:
running = [ f for f in running if not f.ready() ]
avail = processes - len(running)
if avail:
for _ in range(avail):
i = next(tasks)
print(f'[{os.getpid()}] add task({i})')
future = pool.apply_async(run_task, ( i, ))
running.append(future)
else:
time.sleep(0.1)
except StopIteration:
print(f'[{os.getpid()}] all tasks scheduled')
break
print(f'[{os.getpid()}] close and join pool')
pool.close()
pool.join()
print(f'[{os.getpid()}] all done')
Presumably the issue is one of timing, as the failure is indeterminate. As such I have to run it in a loop to get it to hang (although in my experience it will hang on one of the first few iterations).
for i in {1..100}; do ./test.py; done
The output when it hangs:
[15243] add task(0)
[15243] add task(1)
[15243] add task(2)
[15243] add task(3)
[15244] task(0) complete
[15245] task(1) complete
[15246] task(2) complete
[15247] task(3) complete
[15243] add task(4)
[15243] add task(5)
[15251] task(4) complete
[15243] add task(6)
[15243] add task(7)
[15252] task(5) complete
[15253] task(6) complete
[15243] add task(8)
[15243] add task(9)
[15243] all tasks scheduled
[15255] task(8) complete
[15256] task(9) complete
[15243] close and join pool <-- hangs here indefinitely
gdb backtrace of the main process:
#0 0x00007fb132b7c6c2 in __GI___waitpid (pid=22857, stat_loc=0x7fff8ef55d5c, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1 0x00000000005d10e5 in os_waitpid_impl (module=<optimised out>, options=0, pid=22857) at ../Modules/posixmodule.c:6941
#2 os_waitpid.lto_priv () at ../Modules/clinic/posixmodule.c.h:2995
#3 0x000000000050a84f in _PyCFunction_FastCallDict (kwargs=<optimised out>, nargs=<optimised out>, args=<optimised out>, func_obj=0x7fb132fea0d8) at ../Objects/methodobject.c:234
#4 _PyCFunction_FastCallKeywords (kwnames=<optimised out>, nargs=<optimised out>, stack=<optimised out>, func=<optimised out>) at ../Objects/methodobject.c:294
#5 call_function.lto_priv () at ../Python/ceval.c:4851
gdb backtrace of the child process:
#0 0x00007fb1328896d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x1c68e40) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x1c68e40, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007fb1328897c8 in __new_sem_wait_slow (sem=0x1c68e40, abstime=0x0) at sem_waitcommon.c:181
#3 0x00000000005ab535 in PyThread_acquire_lock_timed (intr_flag=<optimised out>, microseconds=<optimised out>, lock=<optimised out>) at ../Python/thread_pthread.h:386
#4 PyThread_acquire_lock () at ../Python/thread_pthread.h:595
#5 0x0000000000446bf1 in _enter_buffered_busy (self=self@entry=0x7fb13307aa98) at ../Modules/_io/bufferedio.c:292
#6 0x00000000004ce743 in buffered_flush.lto_priv () at ../Python/thread_pthread.h:416
Scheduling tasks only when a worker is available:
Each individual task's priority can change while it's waiting to be executed, so I cannot just enqueue all the tasks at the beginning.
Hence the running
list and checking AsyncResult.ready
to determine whether I can execute another task
maxtasksperchild=1:
The tasks leak memory, so in order to reclaim the memory lost after each task has run, I cheat and use maxtasksperchild=1
sleeping vs busy waiting:
Interestingly, if I change the time.sleep(0.1)
to a busy wait, the hang goes away.
wait = time.time() + 0.1
while time.time() < wait:
pass
Is it possible that a signal from a child process is missed during the parent's sleep?
maxtasksperchild=1:
If I reuse the original child processes, the hang goes away.
So it seems to be some interplay between the fact the processes are being destroyed after each task completes, and the parent sleeping.
As a quick fix in production I've changed to sleep to a busy wait, but it feels like an ugly hack, and I'd like to understand what exactly is causing the hang.
join
?I think the issue is with the exception, Technically it should not be there and might already be fixed in later versions of python.
[15243] add task(4)
[15243] add task(5)
[15251] task(4) complete
[15243] add task(6)
[15243] add task(7)
[15252] task(5) complete
[15253] task(6) complete
[15243] add task(8)
[15243] add task(9)
[15243] all tasks scheduled <-- Exception Called but [15254] or task(7) is not completed
[15255] task(8) complete
[15256] task(9) complete
[15243] close and join pool <-- hangs here indefinitely
Something happens at that point of exception call which might cause task(7) to go into a weird state, apply_async allows callbacks which means that 3.6 might be creating the threads in an unstable manner.
Block wait means your main does not sleep and might be faster in handling this. Check if increasing the wait time or using apply() makes a difference.
I am not sure why reusing "fixes" the problem but might just be that access time is faster and easier to handle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With