Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiprocessing slower than serial processing in Windows (but not in Linux)

I'm trying to parallelize a for loop to speed-up my code, since the loop processing operations are all independent. Following online tutorials, it seems the standard multiprocessing library in Python is a good start, and I've got this working for basic examples.

However, for my actual use case, I find that parallel processing (using a dual core machine) is actually a little (<5%) slower, when run on Windows. Running the same code on Linux, however, results in a parallel processing speed-up of ~25%, compared to serial execution.

From the docs, I believe this may relate to Window's lack of fork() function, which means the process needs to be initialised fresh each time. However, I don't fully understand this and wonder if anyone can confirm this please?

Particularly,

--> Does this mean that all code in the calling python file gets run for each parallel process on Windows, even initialising classes and importing packages?

--> If so, can this be avoided by somehow passing a copy (e.g. using deepcopy) of the class into the new processes?

--> Are there any tips / other strategies for efficient parallelisation of code design for both unix and windows.

My exact code is long and uses many files, so I have created a pseucode-style example structure which hopefully shows the issue.

# Imports
from my_package import MyClass
imports many other packages / functions

# Initialization (instantiate class and call slow functions that get it ready for processing)
my_class = Class()
my_class.set_up(input1=1, input2=2)

# Define main processing function to be used in loop
def calculation(_input_data):
    # Perform some functions on _input_data
    ......
    # Call method of instantiate class to act on data
    return my_class.class_func(_input_data)

input_data = np.linspace(0, 1, 50)
output_data = np.zeros_like(input_data)

# For Loop (SERIAL implementation)
for i, x in enumerate(input_data):
    output_data[i] = calculation(x)

# PARALLEL implementation (this doesn't work well!)
with multiprocessing.Pool(processes=4) as pool:
    results = pool.map_async(calculation, input_data)
    results.wait()
output_data = results.get()

EDIT: I do not believe the question is a duplicate of the one suggested, since this relates to a difference in Windows and Linunx, which is not mentioned at all in the suggested duplicate question.

like image 826
IanRoberts Avatar asked Sep 23 '18 10:09

IanRoberts


Video Answer


1 Answers

NT Operating Systems lack the UNIX fork primitive. When a new process is created, it starts as a blank process. It's responsibility of the parent to instruct the new process on how to bootstrap.

Python multiprocessing APIs abstracts the process creation trying to give the same feeling for the fork, forkserver and spawn start methods.

When you use the spawn starting method, this is what happens under the hood.

  1. A blank process is created
  2. The blank process starts a brand new Python interpreter
  3. The Python interpreter is given the MFA (Module Function Arguments) you specified via the Process class initializer
  4. The Python interpreter loads the given module resolving all the imports
  5. The target function is looked up within the module and called with the given args and kwargs

The above flow brings few implications.

As you noticed yourself, it is a much more taxing operation compared to fork. That's why you notice such a difference in performance.

As the module gets imported from scratch in the child process, all import side effects are executed anew. This means that constants, global variables, decorators and first level instructions will be executed again.

On the other side, initializations made during the parent process execution will not be propagated to the child. See this example.

This is why in the multiprocessing documentation they added a specific paragraph for Windows in the Programming Guidelines. I highly recommend to read the Programming Guidelines as they already include all the required information to write portable multi-processing code.

like image 176
noxdafox Avatar answered Sep 27 '22 18:09

noxdafox