Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling worker death in multiprocessing Pool

I have a simple server:

from multiprocessing import Pool, TimeoutError
import time
import os


if __name__ == '__main__':
    # start worker processes
    pool = Pool(processes=1)

    while True:
        # evaluate "os.getpid()" asynchronously
        res = pool.apply_async(os.getpid, ())  # runs in *only* one process
        try:
            print(res.get(timeout=1))             # prints the PID of that process
        except TimeoutError:
            print('worker timed out')

        time.sleep(5)

    pool.close()
    print("Now the pool is closed and no longer available")
    pool.join()
    print("Done")

If I run this I get something like:

47292
47292

Then I kill 47292 while the server is running. A new worker process is started but the output of the server is:

47292
47292
worker timed out
worker timed out
worker timed out

The pool is still trying to send requests to the old worker process.

I've done some work with catching signals in both server and workers and I can get slightly better behaviour but the server still seems to be waiting for dead children on shutdown (ie. pool.join() never ends) after a worker is killed.

What is the proper way to handle workers dying?

Graceful shutdown of workers from a server process only seems to work if none of the workers has died.

(On Python 3.4.4 but happy to upgrade if that would help.)

UPDATE: Interestingly, this worker timeout problem does NOT happen if the pool is created with processes=2 and you kill one worker process, wait a few seconds and kill the other one. However, if you kill both worker processes in rapid succession then the "worker timed out" problem manifests itself again.

Perhaps related is that when the problem occurs, killing the server process will leave the worker processes running.

like image 318
ivo Avatar asked Aug 01 '17 15:08

ivo


1 Answers

This behavior comes from the design of the multiprocessing.Pool. When you kill a worker, you might kill the one holding the call_queue.rlock. When this process is killed while holding the lock, no other process will ever be able to read in the call_queue anymore, breaking the Pool as it cannot communicate with its worker anymore.
So there is actually no way to kill a worker and be sure that your Pool will still be okay after, because you might end up in a deadlock.

multiprocessing.Pool does not handle the worker dying. You can try using concurrent.futures.ProcessPoolExecutor instead (with a slightly different API) which handles the failure of a process by default. When a process dies in ProcessPoolExecutor, the entire executor is shutdown and you get back a BrokenProcessPool error.

Note that there are other deadlocks in this implementation, that should be fixed in loky. (DISCLAIMER: I am a maintainer of this library). Also, loky let you resize an existing executor using a ReusablePoolExecutor and the method _resize. Let me know if you are interested, I can provide you some help starting with this package. (I realized we still need a bit of work on the documentation... 0_0)

like image 123
Thomas Moreau Avatar answered Oct 01 '22 22:10

Thomas Moreau