I watched Brandon Rhodes' talk about Cython - "The Day of the EXE Is Upon Us".
Brandon mentions at 09:30 that for a specific short piece of code, skipping interpretation gave 40% speedup, while skipping the allocation and dispatch gave 574% speedup (10:10).
My question is - how is this measured for a specific piece of code? Does one need to manually extract the underlying c commands and then somehow make the runtime run them?
This is a very interesting observation, but how do I recreate the experiment?
Let's take a look at this python function:
def py_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
and use ipython-magic to time it:
In [11]: %timeit py_fun(0.0,1.0e5,1.0)
10 loops, best of 3: 25.4 ms per loop
The interpreter will be running through the resulting bytecode and interpret it. However, we could cut out the interpreter by using cython for/cythonizing the very same code:
%load_ext Cython
%%cython
def cy_fun(i,N,step):
res=0.0
while i<N:
res+=i
i+=step
return res
We get a speed up of 50% for it:
In [13]: %timeit cy_fun(0.0,1.0e5,1.0)
100 loops, best of 3: 10.9 ms per loop
When we look into the produced c-code, we see that the right functions are called directly without the need of being interpreted/calling ceval
, here after stripping down the boilerplate code:
static PyObject *__pyx_pf_4test_cy_fun(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_i, PyObject *__pyx_v_N, PyObject *__pyx_v_step) {
...
while (1) {
__pyx_t_1 = PyObject_RichCompare(__pyx_v_i, __pyx_v_N, Py_LT);
...
__pyx_t_2 = __Pyx_PyObject_IsTrue(__pyx_t_1);
...
if (!__pyx_t_2) break;
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_res, __pyx_v_i);
...
__pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_i, __pyx_v_step);
}
...
return __pyx_r;
}
However, this cython function handles python-objects and not c-style floats, so in the function PyNumber_InPlaceAdd
it is necessary to figure out, what these objects (integer, float, something else?) really are and to dispatch this call to right functions which would do the job.
With help of cython we could also eliminate the need for this dispatch and to call directly the multiplication for floats:
%%cython
def c_fun(double i,double N, double step):
cdef double res=0.0
while i<N:
res+=i
i+=step
return res
In this version, i
, N
, step
and res
are c-style doubles and no longer python objects. So there is no longer need to call dispatch-functions like PyNumber_InPlaceAdd
but we can directly call +
-operator for double
:
static PyObject *__pyx_pf_4test_c_fun(CYTHON_UNUSED PyObject *__pyx_self, double __pyx_v_i, double __pyx_v_N, double __pyx_v_step) {
...
__pyx_v_res = 0.0;
...
while (1) {
__pyx_t_1 = ((__pyx_v_i < __pyx_v_N) != 0);
if (!__pyx_t_1) break;
__pyx_v_res = (__pyx_v_res + __pyx_v_i);
__pyx_v_i = (__pyx_v_i + __pyx_v_step);
}
...
return __pyx_r;
}
And the result is:
In [15]: %timeit c_fun(0.0,1.0e5,1.0)
10000 loops, best of 3: 148 µs per loop
Now, this is a speed-up of almost 100 compared to the version without interpreter but with dispatch.
Actually, to say, that dispatch+allocation is the bottle neck here (because eliminating it caused a speed-up of almost factor 100) is a fallacy: the interpreter is responsible for more than 50% of the running time (15 ms) and dispatch and allocation "only" for 10ms.
However, there are more problems than "interpreter" and dynamic dispatch for the performance: Float is immutable, so every time it changes a new object must be created and registered/unregistered in garbage collector.
We can introduce mutable floats, which are changed in place and don't need registering/unregistering:
%%cython
cdef class MutableFloat:
cdef double x
def __cinit__(self, x):
self.x=x
def __iadd__(self, MutableFloat other):
self.x=self.x+other.x
return self
def __lt__(MutableFloat self, MutableFloat other):
return self.x<other.x
def __gt__(MutableFloat self, MutableFloat other):
return self.x>other.x
def __repr__(self):
return str(self.x)
The timings (now I use a different machine, so the timings a little bit different):
def py_fun(i,N,step,acc):
while i<N:
acc+=i
i+=step
return acc
%timeit py_fun(1.0, 5e5,1.0,0.0)
30.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%timeit cy_fun(1.0, 5e5,1.0,0.0)
16.9 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); py_fun(i,N,step,acc)
23 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); cy_fun(i,N,step,acc)
11 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Don't forget to reinitialize i
because it is mutable! The results
immutable mutable
py_fun 30ms 23ms
cy_fun 17ms 11ms
So up to 7ms (about 20%) are needed for registering/unregistering floats (I'm not sure there is not something else playing a role) in the version with the interpreter and more then 33% in the version without the interpreter.
As it looks now:
Another problem is the locality of the data, which becomes obvious for memory band-width bound problems: The modern caches work well for if data processed linearly one consecutive memory address after another. This is true for looping over std::vector<>
(or array.array
), but not for looping over python lists, because this list consists of pointers which can point to any place in the memory.
Consider the following python scripts:
#list.py
N=int(1e7)
lst=[0]*int(N)
for i in range(N):
lst[i]=i
print(sum(lst))
and
#byte
N=int(1e7)
b=bytearray(8*N)
m=memoryview(b).cast('L') #reinterpret as an array of unsigned longs
for i in range(N):
m[i]=i
print(sum(m))
they both create 1e7
integers, the first version Python-integers and the second the lowly c-ints which are placed continuously in the memory.
The interesting part is, how many cache misses (D) these scripts produce:
valgrind --tool=cachegrind python list.py
...
D1 misses: 33,964,276 ( 27,473,138 rd + 6,491,138 wr)
versus
valgrind --tool=cachegrind python bytearray.py
...
D1 misses: 4,796,626 ( 2,140,357 rd + 2,656,269 wr)
That means 8 time more cache misses for the python-integers. Some part of it is due to the fact, that python integers need more than 8 bytes (probably 32bytes, i.e. factor 4) memory and (maybe, not 100% sure, because neighboring integers are created after each other, so the chances are high, they are stored after each other somewhere in memory, further investigation needed) some due to the fact, that they aren't aligned in memory as it is the case for c-integers of bytearray
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With