I'm trying to understand the best practices with Python's multiprocessing.Pool object.
In my program I use Pool.imap very frequently. Normally every time I start tasks in parallel I create a new pool object and then close it after I'm done.
I recently encountered a hang where the number of tasks submitted to the pool was less than the number of processes. What was odd was that it only occurred in my test pipeline which had a bunch of things run before it. Running the test as a standalone did not cause the hand. I assume it has to do with making multiple pools.
I'd really like to find some resources to help me understand the best practices in using Python's multiprocessing. Specifically I'm currently trying to understand the implications of making several pool objects versus using only one.
When you create a Pool of worker processes, new processes are spawned from the parent one. This is a very fast operation but it has its cost.
Therefore, as long as you don't have a very good reason, for example the Pool breaks due to one worker dying unexpectedly, it's better to always use the same Pool instance.
The reason for the hang is hard to tell without inspecting the code. You might not have clean the previous instances properly (call close()/stop() and then always call join()). You might have sent too big data through the Pool channel which usually ends up with a deadlock and so on.
Surely a pool does not break if you submit less tasks than workers. The pool is designed exactly to de-couple the number of tasks from the number of workers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With