Demystifying sharedctypes performance

Question

In python it is possible to share ctypes objects between multiple processes. However I notice that allocating these objects seems to be extremely expensive.

Consider following code:

from multiprocessing import sharedctypes as sct
import ctypes as ct
import numpy as np

n = 100000
l = np.random.randint(0, 10, size=n)

def foo1():
    sh = sct.RawArray(ct.c_int, l)
    return sh

def foo2():
    sh = sct.RawArray(ct.c_int, len(l))
    sh[:] = l
    return sh

%timeit foo1()
%timeit foo2()

sh1 = foo1()
sh2 = foo2()

for i in range(n):
    assert sh1[i] == sh2[i]

The output is:

10 loops, best of 3: 30.4 ms per loop
100 loops, best of 3: 9.65 ms per loop

There are two things that puzzle me:

Why is explicit allocation and initialization compared to passing a numpy array so much faster?
Why is allocating shared memory in python so extremely expensive? %timeit np.arange(n) only takes 46.4 µs. There are several orders of magnitude between those timings.

Thomas Orozco · Accepted Answer

Sample Code

I rewrote your sample code a little bit to look into this issue. Here's where I landed, I'll use it in my answer below:

so.py:

from multiprocessing import sharedctypes as sct
import ctypes as ct
import numpy as np

n = 100000
l = np.random.randint(0, 10, size=n)


def sct_init():
    sh = sct.RawArray(ct.c_int, l)
    return sh

def sct_subscript():
    sh = sct.RawArray(ct.c_int, n)
    sh[:] = l
    return sh

def ct_init():
    sh = (ct.c_int * n)(*l)
    return sh

def ct_subscript():
    sh = (ct.c_int * n)(n)
    sh[:] = l
    return sh

Note that I added two test cases that do not use shared memory (and use regular a ctypes array instead).

timer.py:

import traceback
from timeit import timeit

for t in ["sct_init", "sct_subscript", "ct_init", "ct_subscript"]:
    print(t)
    try:
        print(timeit("{0}()".format(t), setup="from so import {0}".format(t), number=100))
    except Exception as e:
        print("Failed:", e)
        traceback.print_exc()
    print

print()

print ("Test",)
from so import *
sh1 = sct_init()
sh2 = sct_subscript()

for i in range(n):
    assert sh1[i] == sh2[i]
print("OK")

Test results

The results from running the above code using Python 3.6a0 (specifically 3c2fbdb) are:

sct_init
2.844902500975877
sct_subscript
0.9383537038229406
ct_init
2.7903486443683505
ct_subscript
0.978101353161037

Test
OK

What's interesting is that if you change n, the results scale linearly. For example, using n = 100000 (10 times bigger), you get something that's pretty much 10 times slower:

sct_init
30.57974253082648
sct_subscript
9.48625904135406
ct_init
30.509132395964116
ct_subscript
9.465419146697968

Test
OK

Speed difference

In the end, the speed difference lies in the hot loop that is called to initialize the array by copying every single value over from the Numpy array (l) to the new array (sh). This makes sense, because as we noted speed scales linearly with array size.

When you pass the Numpy array as a constructor argument, the function that does this is Array_init. However, if you assign using sh[:] = l, then it's Array_ass_subscript that does the job.

Again, what matters here are the hot loops. Let's look at them.

Array_init hot loop (slower):

for (i = 0; i < n; ++i) {
    PyObject *v;
    v = PyTuple_GET_ITEM(args, i);
    if (-1 == PySequence_SetItem((PyObject *)self, i, v))
        return -1;
}

Array_ass_subscript hot loop (faster):

for (cur = start, i = 0; i < otherlen; cur += step, i++) {
    PyObject *item = PySequence_GetItem(value, i);
    int result;
    if (item == NULL)
        return -1;
    result = Array_ass_item(myself, cur, item);
    Py_DECREF(item);
    if (result == -1)
        return -1;
}

As it turns out, the majority of the speed difference lies in using PySequence_SetItem vs. Array_ass_item.

Indeed, if you change the code for Array_init to use Array_ass_item instead of PySequence_SetItem (if (-1 == Array_ass_item((PyObject *)self, i, v))), and recompile Python, the new results become:

sct_init
11.504781467840075
sct_subscript
9.381130554247648
ct_init
11.625461496878415
ct_subscript
9.265848568174988

Test
OK

Still a bit slower, but not by much.

In other words, most of the overhead is caused by a slower hot loop, and mostly caused by the code that PySequence_SetItem wraps around Array_ass_item.

This code might appear like little overhead at first read, but it really isn't.

PySequence_SetItem actually calls into the entire Python machinery to resolve the __setitem__ method and call it.

This eventually resolves in a call to Array_ass_item, but only after a large number of levels of indirection (which a direct call to Array_ass_item would bypass entirely!)

Going through the rabbit hole, the call sequence looks a bit like this:

s->ob_type->tp_as_sequence->sq_ass_item points to slot_sq_ass_item.
slot_sq_ass_item calls into call_method.
call_method calls into PyObject_Call
And on and on until we eventually get to Array_ass_item..!

In other words, we have C code in Array_init that's calling Python code (__setitem__) in a hot loop. That's slow.

Why ?

Now, why does Python use PySequence_SetItem in Array_init and not Array_ass_item in Array_init?

That's because if it did, it would be bypassing the hooks that are exposed to the developer in Python-land.

Indeed, you can intercept calls to sh[:] = ... by subclassing the array and overriding __setitem__ (__setslice__ in Python 2). It will be called once, with a slice argument for the index.

Likewise, defining your own __setitem__ also overrides the logic in the constructor. It will be called N times, with an integer argument for the index.

This means that if Array_init directly called into Array_ass_item, then you would lose something: __setitem__ would no longer be called in the constructor, and you wouldn't be able to override the behavior anymore.

Now can we try to retain the faster speed all the while still exposing the same Python hooks?

Well, perhaps, using this code in Array_init instead of the existing hot loop:

 return PySequence_SetSlice((PyObject*)self, 0, PyTuple_GET_SIZE(args), args);

Using this will call into __setitem__ once with a slice argument (on Python 2, it would call into __setslice__). We still go through the Python hooks, but we only do it once instead of N times.

Using this code, the performance becomes:

sct_init
12.24651838419959
sct_subscript
10.984305887017399
ct_init
12.138383641839027
ct_subscript
11.79078131634742

Test
OK

Other overhead

I think the rest of the overhead may be due to the tuple instantiation that takes place when calling __init__ on the array object (note the *, and the fact that Array_init expects a tuple for args) — this presumably scales with n as well.

Indeed, if you replace sh[:] = l with sh[:] = tuple(l) in the test case, then the performance results become almost identical. With n = 100000:

sct_init
11.538272527977824
sct_subscript
10.985187001060694
ct_init
11.485244687646627
ct_subscript
10.843198659364134

Test
OK

There's probably still something smaller going on, but ultimately we're comparing two substantially different hot loops. There's simply little reason to expect them to have identical performance.

I think it might be interesting to try calling Array_ass_subscript from Array_init for the hot loop and see the results, though!

Baseline speed

Now, to your second question, regarding allocating shared memory.

Note that there isn't really a cost to allocating shared memory. As noted in the results above, there isn't a substantial difference between using shared memory or not.

Looking at the Numpy code (np.arange is implemented here), we can finally understand why it's so much faster than sct.RawArray: np.arange doesn't appear to make calls to Python "user-land" (i.e. no call to PySequence_GetItem or PySequence_SetItem).

That doesn't necessarily explain all the difference, but you'd probably want to start investigating there.

ShadowRanger · Answer

Not an answer (the accepted answer explains this quite well), but for those looking for how to fix this, here's how: Don't use RawArrays slice assignment operator.

As noted in the accepted answer, RawArrays slice assignment operator doesn't take advantage of the fact that you're copying between two wrappers around C-style arrays of identical type and size. But RawArray implements the buffer protocol, so you can wrap it in a memoryview to access it in an "even more raw" way (and it will make Foo2 win, because you can only do this after constructing the object, not as part of construction):

def foo2():
    sh = sct.RawArray(ct.c_int, len(l))
    # l must be another buffer protocol object w/the same C format, which is the case here
    memoryview(sh)[:] = l
    return sh

In tests solving this problem on another question, the time to copy using a memoryview wrapper is less than 1% of the time required to copy with RawArrays normal slice assignment. One trick here is that the sizes of the elements of the output of np.random.randint are np.int, and on a 64 bit system, np.int is 64 bits, so on 64 bit Python, you need another round of copying to coerce it to the right size (or you need to declare the RawArray to be of a type that matches the size of np.int). Even if you do need to make that temporary copy though, it's still much cheaper with a memoryview:

>>> l = np.random.randint(0, 10, size=100000)
>>> %time sh = sct.RawArray(ct.c_int, len(l))
Wall time: 472 µs  # Creation is cheap

>>> %time sh[:] = l
Wall time: 14.4 ms  # TOO LONG!

# Must convert to numpy array with matching element size when c_int and np.int don't match
>>> %time memoryview(sh)[:] = np.array(l, dtype=np.int32)
Wall time: 424 µs

As you can see, even when you need to copy the np.array to resize the elements first, the total time is less than 3% of the time required using RawArray's own slice assignment operator.

If you avoid the temporary copy by making the size of the RawArray match the source, the cost drops further:

# Make it 64 bit to match size of np.int on my machine
>>> %time sh = sct.RawArray(ct.c_int64, len(l))
Wall time: 522 µs  # Creation still cheap, even at double the size

# No need to convert source array now:
>>> %time memoryview(sh)[:] = l
Wall time: 123 µs

which gets us down to 0.85% of the RawArray slice assignment time; at this point, you're basically running at memcpy speeds; the rest of your actual Python code will swamp the miniscule amount of time spent on data copying.

nicolas · Answer

This should be a comment, but I do not have enough reputation :-(

Starting with Python 3.5, shared arrays in Linux are created as temp files mapped to memory (see https://bugs.python.org/issue30919). I think this explains why creating a Numpy array, which is created in memory, is faster than creating and initializing a large shared array. To force Python to use shared memory, a workaround is to execute these two lines of code (ref. No space left while using Multiprocessing.Array in shared memory):

from multiprocessing.process import current_process
current_process()._config['tempdir'] = '/dev/shm'

Demystifying sharedctypes performance

Tags:

python

cel

3 Answers

Sample Code

Test results

Speed difference

Why ?

Other overhead

Baseline speed

Thomas Orozco

ShadowRanger

nicolas

Recent Activity

Donate For Us

Demystifying sharedctypes performance

Tags:

python

cel

3 Answers

Sample Code

Test results

Speed difference

Why ?

Other overhead

Baseline speed

Thomas Orozco

ShadowRanger

nicolas

Related questions

Recent Activity

Donate For Us