I am summing each element in a 1D array using either Cython or NumPy. When summing integers Cython is ~20% faster. When summing floats, Cython is ~2.5x slower. Below are the two simple functions used.
#cython: boundscheck=False
#cython: wraparound=False
def sum_int(ndarray[np.int64_t] a):
cdef:
Py_ssize_t i, n = len(a)
np.int64_t total = 0
for i in range(n):
total += a[i]
return total
def sum_float(ndarray[np.float64_t] a):
cdef:
Py_ssize_t i, n = len(a)
np.float64_t total = 0
for i in range(n):
total += a[i]
return total
Create two arrays of 1 million elements each:
a_int = np.random.randint(0, 100, 10**6)
a_float = np.random.rand(10**6)
%timeit sum_int(a_int)
394 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_int.sum()
490 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sum_float(a_float)
982 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_float.sum()
383 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
sum_float
is the same with the boundscheck
and wraparound
directives missing. Why?sum_int
to a C pointer (np.int64_t *arr = <np.int64_t *> a.data
) improves performance by an additional 25%. Doing so for the floats does nothingHow can I get the same performance in Cython with floats that I do with integers?
I wrote an even simpler function that just counts the number of iterations. The first stores the count as an int, the latter as a double.
def count_int():
cdef:
Py_ssize_t i, n = 1000000
int ct=0
for i in range(n):
ct += 1
return ct
def count_double():
cdef:
Py_ssize_t i, n = 1000000
double ct=0
for i in range(n):
ct += 1
return ct
I ran these just once (afraid of caching). No idea if the loop is actually being executed for the integer, but count_double
has the same performance as the sum_float
from above. This is crazy...
%timeit -n 1 -r 1 count_int()
1.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n 1 -r 1 count_double()
971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
7 Answers. Show activity on this post. Well, here's a quick explanation: An int and float usually take up "one-word" in memory.
I'm not going to answer all your questions, but only (in my eyes) the most interesting ones.
Let start with your counting example:
1.0*10**6
and because cython compiles in IEEE 754 (not -ffast-math
) mode per default.This you must keep it mind when looking at your cython-code: the compiler is not allowed to rearrange the summations (IEEE 754) and because the result of the first summations is needed for the next there is only one long line in which all operations wait.
But the most crucial insight: the numpy is not doing the same as your cython code:
>>> sum_float(a_float)-a_float.sum()
2.9103830456733704e-08
Yep, nobody told numpy (differently from your cython code) that the sum has to be calculated like this
((((a_1+a2)+a3)+a4)+...
And numpy take advantage of it in two ways:
it performs pairwise summation (kind of), which leads to a smaller rounding error.
it calculates sum in chunks (the code of python is somewhat hard to understand, here is the corresponding template and further down the listing of the used function pairwise_sum_DOUBLE
)
The second point is the reason for the speed-up your are observing, the calculation happens similar to the following schema (at least what I understand from the source code below):
a1 + a9 + ..... = r1
a2 + a10 + ..... = r2
..
a8 + a16 + = r8
----> sum=r1+....+r8
The advantage of this kind of summation: the result of a2+a10
doesn't depend on a1+a9
and these both values can be calculated simultaneously (e.g. pipelining) on modern CPUs, which leads to the speed-up you are observing.
For what it is worth, on my machine the cython-integer-sum is slower than numpy's.
The need of taking the stride of the numpy-array into account (which is only known at the run-time, see also this question about vectorization) prevents some optimizations. A workaround is to use memory-views, for which you can make clear, that data is continuous, i.e.:
def sum_int_cont(np.int64_t[::1] a):
Which leads on my machine to a significant speed-up (factor 2):
%timeit sum_int(a_int)
2.64 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sum_int_cont(a_int)
1.31 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_int.sum()
2.1 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It's true, that in this case using memory-views for doubles doesn't bring any speed-up (don't know why), but in general it simplifies the life of the optimizer. For example combining the memory-view-variant with -ffast-math
compile options, which would allow the associativity, leads to a performance comparable with numpy:
%%cython -c=-ffast-math
cimport numpy as np
def sum_float_cont(np.float64_t[::1] a):
cdef:
Py_ssize_t i, n = len(a)
np.float64_t total = 0
for i in range(n):
total += a[i]
return total
And now:
>>> %timeit sum_float(a_float)
3.46 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit sum_float_cont(a_float)
1.87 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit a_float.sum()
1.41 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Listing of pairwise_sum_DOUBLE
:
/*
* Pairwise summation, rounding error O(lg n) instead of O(n).
* The recursion depth is O(lg n) as well.
* when updating also update similar complex floats summation
*/
static npy_double
pairwise_sum_DOUBLE(npy_double *a, npy_uintp n, npy_intp stride)
{
if (n < 8) {
npy_intp i;
npy_double res = 0.;
for (i = 0; i < n; i++) {
res += (a[i * stride]);
}
return res;
}
else if (n <= PW_BLOCKSIZE) {
npy_intp i;
npy_double r[8], res;
/*
* sum a block with 8 accumulators
* 8 times unroll reduces blocksize to 16 and allows vectorization with
* avx without changing summation ordering
*/
r[0] = (a[0 * stride]);
r[1] = (a[1 * stride]);
r[2] = (a[2 * stride]);
r[3] = (a[3 * stride]);
r[4] = (a[4 * stride]);
r[5] = (a[5 * stride]);
r[6] = (a[6 * stride]);
r[7] = (a[7 * stride]);
for (i = 8; i < n - (n % 8); i += 8) {
r[0] += (a[(i + 0) * stride]);
r[1] += (a[(i + 1) * stride]);
r[2] += (a[(i + 2) * stride]);
r[3] += (a[(i + 3) * stride]);
r[4] += (a[(i + 4) * stride]);
r[5] += (a[(i + 5) * stride]);
r[6] += (a[(i + 6) * stride]);
r[7] += (a[(i + 7) * stride]);
}
/* accumulate now to avoid stack spills for single peel loop */
res = ((r[0] + r[1]) + (r[2] + r[3])) +
((r[4] + r[5]) + (r[6] + r[7]));
/* do non multiple of 8 rest */
for (; i < n; i++) {
res += (a[i * stride]);
}
return res;
}
else {
/* divide by two but avoid non-multiples of unroll factor */
npy_uintp n2 = n / 2;
n2 -= n2 % 8;
return pairwise_sum_DOUBLE(a, n2, stride) +
pairwise_sum_DOUBLE(a + n2 * stride, n - n2, stride);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With