Performance discrepancy between OSX and Linux for communication using Python multiprocessing

Question

I have been trying to learn more about Python's multiprocessing module and to evaluate different techniques for communication between processes. I wrote a benchmark that compares the performance of Pipe, Queue, and Array (all from multiprocessing) for transferring numpy arrays between processes. The full benchmark can be found here. Here's a snippet of the test for Queue:

def process_with_queue(input_queue, output_queue):
    source = input_queue.get()
    dest = source**2
    output_queue.put(dest)


def test_with_queue(size):

    source = np.random.random(size)

    input_queue = Queue()
    output_queue = Queue()

    p = Process(target=process_with_queue, args=(input_queue, output_queue))
    start = timer()
    p.start()
    input_queue.put(source)
    result = output_queue.get()
    end = timer()

    np.testing.assert_allclose(source**2, result)

    return end - start

I ran this test on my Linux laptop and got the following results for an array size of 1000000:

Using mp.Array: time for 20 iters: total=2.4869s, avg=0.12435s
Using mp.Queue: time for 20 iters: total=0.6583s, avg=0.032915s
Using mp.Pipe:  time for 20 iters: total=0.63691s, avg=0.031845s

I was a little surprised to see Array perform so poorly since it uses shared memory and presumably doesn't require pickling, but I assume there must be some copying in numpy that I can't control.

However, I ran the same test (again for array size 1000000) on a Macbook, and got the following results:

Using mp.Array: time for 20 iters: total=1.6917s, avg=0.084587s
Using mp.Queue: time for 20 iters: total=2.3478s, avg=0.11739s
Using mp.Pipe:  time for 20 iters: total=8.7709s, avg=0.43855s

The real timing differences aren't that surprising since of course different systems would exhibit different performance. What is so surprising are the differences in relative timing.

What could account for this? This is a pretty surprising result to me. I wouldn't be surprised to see such stark differences between Linux and Windows, or OSX and Windows, but I sort of assumed that these things would behave very similarly between OSX and Linux.

This question addresses performance differences between Windows and OSX, which seems more expected.

agtoever · Accepted Answer

TL;DR: OSX is faster with Array because calls to the C library slow Array down on Linux

Using Array from multiprocessing uses the C types Python library to make a C call to set memory for the Array. This takes relatively more time on Linux than on OSX. You can also observe this on OSX by using pypy. Setting memory takes much longer using pypy (and GCC and LLVM) than using python3 on OSX (using Clang).

TL;DR: the difference between Windows and OSX lies in the way multiprocessing starts new processes

The major difference is in the implementation of multiprocessing, which works different under OSX than in Windows. The most important difference is the way multiprocessing starts a new process. There are three ways this can be done: using spawn, fork or forkserver. The default (and only supported) way under Windows is spawn. The default way under *nix (including OSX) is fork. This is documented in the Contexts and start methods section of the multiprocessing documentation.

One other reason for the deviation in results is the low number of iterations you take.

If you increase the number of iterations and calculate the number of handled function calls per time unit, you get relatively consistent results between the three methods.

Further analysis: look at the function calls with cProfile

I removed your timeit timer functions and wrapped your code in the cProfile profiler.

I added this wrapper function:

def run_test(iters, size, func):
    for _ in range(iters):
        func(size)

And I replaced the loop in main() with:

for func in [test_with_array, test_with_pipe, test_with_queue]:
    print(f"*** Running {func.__name__} ***")
    pr = cProfile.Profile()
    pr.enable()
    run_test(args.iters, args.size, func)
    pr.disable()
    ps = pstats.Stats(pr, stream=sys.stdout)
    ps.strip_dirs().sort_stats('cumtime').print_stats()

Analysis of the OSX - Linux difference with Array

What I see is that Queue is faster than Pipe, which is faster than Array. Regardsless of the platform (OSX/Linux/Windows), Queue is between 2 and 3 times faster than Pipe. On OSX and Windows, Pipe is around 1.2 and 1.5 times faster than Array. But on Linux, Pipe is around 3.6 times faster than Array. In other words, On Linux, Array is relatively much slower than on Windows and OSX. This is strange.

Using the cProfile data, I compared the performance ratio between OSX and Linux. There are two function calls that take a lot of time: Array and RawArray in sharedctypes.py. These functions are only called in the Array scenario (not in Pipe or Queue). On Linux, these calls take almost 70% of the time, while on OSX only 42% of the time. So this a major factor.

If we zoom in to the code, we see that Array (line 84) calls RawArray, and RawArray (line 54) does nothing special, except a call to ctypes.memset (documentation). So there we have a suspect. Let's test it.

The following code uses timeit to test the performance of setting 1 MB of memory buffer to 'A'.

import timeit
cmds = """\
import ctypes
s=ctypes.create_string_buffer(1024*1024)
ctypes.memset(ctypes.addressof(s), 65, ctypes.sizeof(s))"""
timeit.timeit(cmds, number=100000)

Running this on my MacBookPro and on my Linux server confirms the behaviour that this runs much slower on Linux than on OSX. Knowing that pypy is on OSX compiled using GCC and Apples LLVM, this is more akin to the Linux world than Python, which is on OSX compiled directly against Clang. Normally, Python programs runs faster on pypy than on CPython, but the code above runs 6.4 times slower on pypy (on the same hardware!).

My knowlegde of C toolchains and C libraries is limited, so I can't dig deeper. So my conclusion is: OSX and Windows are faster with Array because memory calls to the C library slow Array down on Linux.

Analysis of the OSX - Windows performance difference

Next I ran this on my dual-boot MacBook Pro under OSX and under Windows. The advantage is that the underlying hardware is the same; only the OS is different. I increased the number of iterations to 1000 and the size to 10.000.

The results are as follows:

OSX:
- Array: 225668 calls in 10.895 seconds
- Pipe: 209552 calls in 6.894 seconds
- Queue: 728173 calls in 7.892 seconds
Windows:
- Array: 354076 calls in 296.050 seconds
- Pipe: 374229 calls in 234.996 seconds
- Queue: 903705 calls in 250.966 seconds

We can see that:

The Windows implementation (using spawn) takes more calls than OSX (using fork);
The Windows implementation takes much more time per call than OSX.

What's not immediately evident, but relevant to note is that if you look at the average time per call, the relative pattern between the three multiprocessing methodes (Array, Queue and Pipe) is the same (see graphs below). In other words: the differences in performance between Array, Queue and Pipe in OSX and Windows can be completely explained by two factors: 1. the difference in Python performance between the two platforms; 2. the different ways both platforms handle multiprocessing.

In other words: the difference in the number of calls is explained by the Contexts and start methods section of the multiprocessing documentation. The difference in execution time is explained in the performance difference of Python between OSX and Windows. If you factor out those two components, the relative performance of Array, Queue and Pipe are (more or less) comparable on OSX and Windows, as is shown in the graphs below.

Performance differences of Array, Queue and Pipe between OSX and Windows

Performance discrepancy between OSX and Linux for communication using Python multiprocessing

Tags:

python

linux

macos

multiprocessing

communication

ddavella

1 Answers

TL;DR: OSX is faster with Array because calls to the C library slow Array down on Linux

TL;DR: the difference between Windows and OSX lies in the way multiprocessing starts new processes

Further analysis: look at the function calls with cProfile

Analysis of the OSX - Linux difference with Array

Analysis of the OSX - Windows performance difference

agtoever

Recent Activity

Donate For Us

Performance discrepancy between OSX and Linux for communication using Python multiprocessing

Tags:

python

linux

macos

multiprocessing

communication

ddavella

1 Answers

TL;DR: OSX is faster with Array because calls to the C library slow Array down on Linux

TL;DR: the difference between Windows and OSX lies in the way multiprocessing starts new processes

Further analysis: look at the function calls with cProfile

Analysis of the OSX - Linux difference with Array

Analysis of the OSX - Windows performance difference

agtoever

Related questions

Recent Activity

Donate For Us