Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it important to protect the main loop when using joblib.Parallel?

Tags:

The joblib docs contain the following warning:

Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.Parallel. In other words, you should be writing code like this:

import ....  def function1(...):     ...  def function2(...):     ...  ... if __name__ == '__main__':     # do stuff with imports and functions defined about     ... 

No code should run outside of the “if __name__ == ‘__main__’” blocks, only imports and definitions.

Initially, I assumed this was just to prevent against the occasional odd case where a function passed to joblib.Parallel called the module recursively, which would mean it was generally good practice but often unnecessary. However, it doesn't make sense to me why this would only be a risk on Windows. Additionally, this answer seems to indicate that failure to protect the main loop resulted in the code running several times slower than it otherwise would have for a very simple non-recursive problem.

Out of curiosity, I ran the super-simple example of an embarrassingly parallel loop from the joblib docs without protecting the main loop on a windows box. My terminal was spammed with the following error until I closed it:

ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not suppo rt forking. To use parallel-computing in a script, you must protect you main loop using "if __name__ == '__main__'". Ple ase see the joblib documentation on Parallel for more information 

My question is, what about the windows implementation of joblib requires the main loop to be protected in every case?

Apologies if this is a super basic question. I am new to the world of parallelization, so I might just be missing some basic concepts, but I couldn't find this issue discussed explicitly anywhere.

Finally, I want to note that this is purely academic; I understand why it is generally good practice to write one's code in this way, and will continue to do so regardless of joblib.

like image 885
Joe Avatar asked Apr 09 '15 17:04

Joe


People also ask

Does joblib parallel preserve order?

TL;DR - it preserves order for both backends.

What is joblib parallel?

Joblib is such an pacakge that can simply turn our Python code into parallel computing mode and of course increase the computing speed. Joblib is optimized to be fast and robust in particular on large data and has specific optimizations for numpy arrays.

Why is joblib used?

Joblib is a set of tools to provide lightweight pipelining in Python. In particular: transparent disk-caching of functions and lazy re-evaluation (memoize pattern) easy simple parallel computing.

How do you make a for loop parallel in Python?

Use the joblib Module to Parallelize the for Loop in Python The delayed() function allows us to tell Python to call a particular mentioned method after some time. The Parallel() function creates a parallel instance with specified cores (2 in this case). We need to create a list for the execution of the code.


2 Answers

This is necessary because Windows doesn't have fork(). Because of this limitation, Windows needs to re-import your __main__ module in all the child processes it spawns, in order to re-create the parent's state in the child. This means that if you have the code that spawns the new process at the module-level, it's going to be recursively executed in all the child processes. The if __name__ == "__main__" guard is used to prevent code at the module scope from being re-executed in the child processes.

This isn't necessary on Linux because it does have fork(), which allows it to fork a child process that maintains the same state of the parent, without re-importing the __main__ module.

like image 116
dano Avatar answered Jan 01 '23 20:01

dano


In case someone stumbles across this in 2021: Due to the new backend "loky" used by joblib>0.12 protecting the main for loop is no longer required. See https://joblib.readthedocs.io/en/latest/parallel.html

like image 20
burtphil Avatar answered Jan 01 '23 20:01

burtphil