Using python multiprocessing Pool in the terminal and in code modules for Django or Flask

Tags:

When using multiprocessing.Pool in python with the following code, there is some bizarre behavior.

from multiprocessing import Pool
p = Pool(3)
def f(x): return x
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
    try: print(t.get(timeout=1))
    except Exception: pass

I get the following error three times (one for each thread in the pool), and it prints "3" through "19":

AttributeError: 'module' object has no attribute 'f'

The first three apply_async calls never return.

Meanwhile, if I try:

from multiprocessing import Pool
p = Pool(3)
def f(x): print(x)
p.map(f, range(20))

I get the AttributeError 3 times, the shell prints "6" through "19", and then hangs and cannot be killed by [Ctrl] + [C]

The multiprocessing docs have the following to say:

Functionality within this package requires that the main module be importable by the children.

What does this mean?

To clarify, I'm running code in the terminal to test functionality, but ultimately I want to be able to put this into modules of a web server. How do you properly use multiprocessing.Pool in the python terminal and in code modules?

988

asked Sep 22 '13 19:09

Zags

2 Answers

Caveat: Multiprocessing is the wrong tool to use in the context of web servers like Django and Flask. Instead, you should use a task framework like Celery or an infrastructure solution like Elastic Beanstalk Worker Environments. Using multiprocessing to spawn threads or processes is bad because it gives you no oversight or management of those threads/processes, and so you have to build your own failure detection logic, retry logic, etc. At that point, you are better served by using an off-the-shelf tool that is actually designed to handle asynchronous tasks, because it will give you these out of the box.

Understanding the docs

Functionality within this package requires that the main module be importable by the children.

What this means is that pools must be initialized after the definitions of functions to be run on them. Using pools within if __name__ == "__main__": blocks works if you are writing a standalone script, but this isn't possible in either larger code bases or server code (such as a Django or Flask project). So, if you're trying to use Pools in one of these, make sure to follow these guidelines, which are explained in the sections below:

Initialize Pools inside functions whenever possible. If you have to initialize them in the global scope, do so at the bottom of the module.
Do not call the methods of a Pool in the global scope.

Alternatively, if you only need better parallelism on I/O (like database accesses or network calls), you can save yourself all this headache and use pools of threads instead of pools of processes. This involves the completely undocumented:

from multiprocessing.pool import ThreadPool

It's interface is exactly the same as that of Pool, but since it uses threads and not processes, it comes with none of the caveats that using process pools do, with the only downside being you don't get true parallelism of code execution, just parallelism in blocking I/O.

Pools must be initialized after the definitions of functions to be run on them

The inscrutable text from the python docs means that at the time the pool is defined, the surrounding module is imported by the threads in the pool. In the case of the python terminal, this means all and only code you have run so far.

So, any functions you want to use in the pool must be defined before the pool is initialized. This is true both of code in a module and code in the terminal. The following modifications of the code in the question will work fine:

from multiprocessing import Pool
def f(x): return x  # FIRST
p = Pool(3) # SECOND
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
    try: print(t.get(timeout=1))
    except Exception: pass

from multiprocessing import Pool
def f(x): print(x)  # FIRST
p = Pool(3) # SECOND
p.map(f, range(20))

By fine, I mean fine on Unix. Windows has it's own problems, that I'm not going into here.

Using pools in modules

But wait, there's more (to using pools in modules that you want to import elsewhere)!

If you define a pool inside a function, you have no problems. But if you are using a Pool object as a global variable in a module, it must be defined at the bottom of the page, not the top. Though this goes against most good code style, it is necessary for functionality. The way to use a pool declared at the top of a page is to only use it with functions imported from other modules, like so:

from multiprocessing import Pool
from other_module import f
p = Pool(3)
p.map(f, range(20))

Importing a pre-configured pool from another module is pretty horrific, as the import must come after whatever you want to run on it, like so:

### module.py ###
from multiprocessing import Pool
POOL = Pool(5)

### module2.py ###
def f(x):
    # Some function
from module import POOL
POOL.map(f, range(10))

And second, if you run anything on the pool in the global scope of a module that you are importing, the system hangs. i.e. this doesn't work:

### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
print(p.map(f, range(5)))

### module2.py ###
import module

This, however, does work, as long as nothing imports module2:

### module.py ###
from multiprocessing import Pool

def f(x): return x
p = Pool(1)
def run_pool(): print(p.map(f, range(5)))

### module2.py ###
import module
module.run_pool()

Now, the reasons behind this are only more bizarre, and likely related to the reason that the code in the question only spits an Attribute Error once each and after that appear to execute code properly. It also appears that pool threads (at least with some reliability) reload the code in module after executing.

answered Oct 23 '22 10:10

Zags

The function you want to execute on a thread pool must already be defined when you create the pool.

This should work:

from multiprocessing import Pool
def f(x): print(x)
if __name__ == '__main__':
    p = Pool(3)
    p.map(f, range(20))

The reason is that (at least on systems having fork) when you create a pool the workers are created by forking the current process. So if the target function isn't already defined at that point, the worker won't be able to call it.

On windows it's a bit different, as windows doesn't have fork. Here new worker processes are started and the main module is imported. That's why on windows it's important to protect the executing code with a if __name__ == '__main__'. Otherwise each new worker will reexecute the code and therefore spawn new processes infinitely, crashing the program (or the system).

answered Oct 23 '22 10:10

mata

Related questions
                            
                                PyQt showing video stream from opencv
                            
                                How to consume the Github GraphQL API using Python?
                            
                                How to create a grouped bar plot
                            
                                Matplotlib move tick labels inside plot area
                            
                                Does Python have the Elvis operator?
                            
                                Is there a sendKey for Mac in Python?
                            
                                Django urlsafe base64 decoding with decryption
                            
                                Python UTF-8 comparison
                            
                                Python subprocess timeout?
                            
                                How do I import function from .pyx file in python?
                            
                                Python: Extract using tarfile but ignoring directories
                            
                                Forcing Python json module to work with ASCII
                            
                                use "pip install/uninstall" inside a python script [duplicate]
                            
                                Python 2.7 creating a multidimensional list
                            
                                Search in PyCharm interactive console command history
                            
                                How to run celery as a daemon in production?
                            
                                UnicodeEncodeError：'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)
                            
                                Receiving Import Error: No Module named ***, but has __init__.py
                            
                                django - how to sort objects alphabetically by first letter of name field
                            
                                Python read website data line by line when available

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using python multiprocessing Pool in the terminal and in code modules for Django or Flask

Tags:

python

django

flask

multiprocessing

pool

Zags

People also ask

2 Answers

Understanding the docs

Pools must be initialized after the definitions of functions to be run on them

Using pools in modules

Zags

mata

Recent Activity

Donate For Us