I have been trying to learn more about Python's multiprocessing
module and to evaluate different techniques for communication between processes. I wrote a benchmark that compares the performance of Pipe
, Queue
, and Array
(all from multiprocessing
) for transferring numpy
arrays between processes. The full benchmark can be found here. Here's a snippet of the test for Queue
:
def process_with_queue(input_queue, output_queue):
source = input_queue.get()
dest = source**2
output_queue.put(dest)
def test_with_queue(size):
source = np.random.random(size)
input_queue = Queue()
output_queue = Queue()
p = Process(target=process_with_queue, args=(input_queue, output_queue))
start = timer()
p.start()
input_queue.put(source)
result = output_queue.get()
end = timer()
np.testing.assert_allclose(source**2, result)
return end - start
I ran this test on my Linux laptop and got the following results for an array size of 1000000:
Using mp.Array: time for 20 iters: total=2.4869s, avg=0.12435s
Using mp.Queue: time for 20 iters: total=0.6583s, avg=0.032915s
Using mp.Pipe: time for 20 iters: total=0.63691s, avg=0.031845s
I was a little surprised to see Array
perform so poorly since it uses shared memory and presumably doesn't require pickling, but I assume there must be some copying in numpy
that I can't control.
However, I ran the same test (again for array size 1000000) on a Macbook, and got the following results:
Using mp.Array: time for 20 iters: total=1.6917s, avg=0.084587s
Using mp.Queue: time for 20 iters: total=2.3478s, avg=0.11739s
Using mp.Pipe: time for 20 iters: total=8.7709s, avg=0.43855s
The real timing differences aren't that surprising since of course different systems would exhibit different performance. What is so surprising are the differences in relative timing.
What could account for this? This is a pretty surprising result to me. I wouldn't be surprised to see such stark differences between Linux and Windows, or OSX and Windows, but I sort of assumed that these things would behave very similarly between OSX and Linux.
This question addresses performance differences between Windows and OSX, which seems more expected.
Using Array
from multiprocessing
uses the C types Python library to make a C call to set memory for the Array. This takes relatively more time on Linux than on OSX. You can also observe this on OSX by using pypy. Setting memory takes much longer using pypy (and GCC and LLVM) than using python3 on OSX (using Clang).
The major difference is in the implementation of multiprocessing
, which works different under OSX than in Windows. The most important difference is the way multiprocessing
starts a new process. There are three ways this can be done: using spawn
, fork
or forkserver
. The default (and only supported) way under Windows is spawn
. The default way under *nix (including OSX) is fork
. This is documented in the Contexts and start methods section of the multiprocessing
documentation.
One other reason for the deviation in results is the low number of iterations you take.
If you increase the number of iterations and calculate the number of handled function calls per time unit, you get relatively consistent results between the three methods.
I removed your timeit
timer functions and wrapped your code in the cProfile
profiler.
I added this wrapper function:
def run_test(iters, size, func):
for _ in range(iters):
func(size)
And I replaced the loop in main()
with:
for func in [test_with_array, test_with_pipe, test_with_queue]:
print(f"*** Running {func.__name__} ***")
pr = cProfile.Profile()
pr.enable()
run_test(args.iters, args.size, func)
pr.disable()
ps = pstats.Stats(pr, stream=sys.stdout)
ps.strip_dirs().sort_stats('cumtime').print_stats()
What I see is that Queue is faster than Pipe, which is faster than Array. Regardsless of the platform (OSX/Linux/Windows), Queue is between 2 and 3 times faster than Pipe. On OSX and Windows, Pipe is around 1.2 and 1.5 times faster than Array. But on Linux, Pipe is around 3.6 times faster than Array. In other words, On Linux, Array is relatively much slower than on Windows and OSX. This is strange.
Using the cProfile data, I compared the performance ratio between OSX and Linux. There are two function calls that take a lot of time: Array
and RawArray
in sharedctypes.py
. These functions are only called in the Array scenario (not in Pipe or Queue). On Linux, these calls take almost 70% of the time, while on OSX only 42% of the time. So this a major factor.
If we zoom in to the code, we see that Array
(line 84) calls RawArray
, and RawArray
(line 54) does nothing special, except a call to ctypes.memset
(documentation). So there we have a suspect. Let's test it.
The following code uses timeit to test the performance of setting 1 MB of memory buffer to 'A'.
import timeit
cmds = """\
import ctypes
s=ctypes.create_string_buffer(1024*1024)
ctypes.memset(ctypes.addressof(s), 65, ctypes.sizeof(s))"""
timeit.timeit(cmds, number=100000)
Running this on my MacBookPro and on my Linux server confirms the behaviour that this runs much slower on Linux than on OSX. Knowing that pypy is on OSX compiled using GCC and Apples LLVM, this is more akin to the Linux world than Python, which is on OSX compiled directly against Clang. Normally, Python programs runs faster on pypy than on CPython, but the code above runs 6.4 times slower on pypy (on the same hardware!).
My knowlegde of C toolchains and C libraries is limited, so I can't dig deeper. So my conclusion is: OSX and Windows are faster with Array because memory calls to the C library slow Array down on Linux.
Next I ran this on my dual-boot MacBook Pro under OSX and under Windows. The advantage is that the underlying hardware is the same; only the OS is different. I increased the number of iterations to 1000 and the size to 10.000.
The results are as follows:
We can see that:
spawn
) takes more calls than OSX (using fork
);What's not immediately evident, but relevant to note is that if you look at the average time per call, the relative pattern between the three multiprocessing methodes (Array, Queue and Pipe) is the same (see graphs below). In other words: the differences in performance between Array, Queue and Pipe in OSX and Windows can be completely explained by two factors: 1. the difference in Python performance between the two platforms; 2. the different ways both platforms handle multiprocessing.
In other words: the difference in the number of calls is explained by the Contexts and start methods section of the multiprocessing
documentation. The difference in execution time is explained in the performance difference of Python between OSX and Windows. If you factor out those two components, the relative performance of Array, Queue and Pipe are (more or less) comparable on OSX and Windows, as is shown in the graphs below.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With