Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocessing pool stuck

I'm trying to run some sample code of the multiprocessing.pool module of python, found in the web. The code is:

def square(x):
    return x * x
if __name__ == '__main__':
    pool = Pool(processes=4)
    inputs = [0, 1, 2, 3, 4]
    outputs = pool.map(square, inputs)

But when i try to run it, it never finsh the execution and i have to restart the kernel of my IpythonNotebook notebook. What's the problem?

like image 246
Duccio Bertieri Avatar asked Dec 04 '15 10:12

Duccio Bertieri


People also ask

How do you restart a Python process?

Calling the start() function on a terminated process will result in an AssertionError indicating that the process can only be started once. Instead, to restart a process in Python, you must create a new instance of the process with the same configuration and then call the start() function.

Why is multiprocessing slow in Python?

The multiprocessing version is slower because it needs to reload the model in every map call because the mapped functions are assumed to be stateless. The multiprocessing version looks as follows. Note that in some cases, it is possible to achieve this using the initializer argument to multiprocessing.

How does Python multiprocessing pool work?

It works like a map-reduce architecture. It maps the input to the different processors and collects the output from all the processors. After the execution of code, it returns the output in form of a list or array. It waits for all the tasks to finish and then returns the output.


Video Answer


1 Answers

As you may read from the answer pointed out by John in the comments, multiprocessing.Pool, in general, should not be expected to work well within an interactive interpreter. To understand why it is the case, consider how Pool does its job:

  • It forks python workers, passing to them the name of the current Python file.
  • The workers then essentially do import <this file>, and listen for messages from the master.
  • The master sends function names along with function arguments to the workers via pickling. Note that functions themselves cannot be sent, because the pickle protocol does not allow that.

When you try to perform this procedure from an interactive prompt, there is no reasonable "current Python file" to pass to the children for importing. Moreover, the functions you defined in your interactive prompt are not part of any module (they are dynamically defined), and hence cannot be imported by the children from that nonexistent module. So your easiest bet is to simply avoid using multiprocessing within IPython. IPython parallel is so much better anyway :)


For completeness' sake I also checked what exactly happens in my particular case of an IPython 4 running under Python 2.7 on Windows 8 (where I can observe the interpreter getting stuck as well). Interestingly, the reason IPython gets stuck in the first place is not one of those mentioned above.

It turns out that multiprocessing checks whether __main__.__file__ is defined, and if not, sends sys.argv[0] as the "current filename" to the children. In the case of (my version of) IPython sys.argv[0] is equal to C:\Dev\Anaconda\lib\site-packages\ipykernel\__main__.py.

Unfortunately, the worker processes before starting up happen to check whether the file they are going to import is already in their sys.modules. Line 488 of multiprocessing/forking.py says:

assert main_name not in sys.modules, main_name

When the main_name is __main__ (as is the case with ipython's workers) this assertion fails and the workers fail to start. The same code, however, is "smart" enough to check whether the passed name is ipython, in which case it does no such checks nor does not import anything.

Consequently, the problem of workers failing to start could be solved using an ugly hack of defining __main__.__file__ to be equal to ipython. The following code does work fine from an IPython cell:

import sys
sys.modules['__main__'].__file__ = 'ipython'
from multiprocessing import Pool

pool = Pool(processes=4)
inputs = [0, 1, 2, 3, 4]
outputs = pool.map(abs, inputs)

Note that this example asks the workers to compute abs, a built-in function. It would fail (gracefully, with an exception) if you asked the workers to compute a function you defined within the notebook.

It turns out you can, in principle, go further with the hacking and have your functions sent over to the workers using some manual pickling of their code. You can find a pretty cool example of such a hack here.

like image 165
KT. Avatar answered Oct 14 '22 03:10

KT.