I am summing each element in a 1D array using either Cython or NumPy. When summing integers Cython is ~20% faster. When summing floats, Cython is ~2.5x slower. Below are the two simple functions used. <pre class="prettyprint"><code>#cython: boundscheck=False #cython: wraparound=False def sum_int(ndarray[np.int64_t] a): cdef: Py_ssize_t i, n = len(a) np.int64_t total = 0 for i in range(n): total += a[i] return total def sum_float(ndarray[np.float64_t] a): cdef: Py_ssize_t i, n = len(a) np.float64_t total = 0 for i in range(n): total += a[i] return total </code></pre> <h3>Timings</h3> Create two arrays of 1 million elements each: <pre class="prettyprint"><code>a_int = np.random.randint(0, 100, 10**6) a_float = np.random.rand(10**6) %timeit sum_int(a_int) 394 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit a_int.sum() 490 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit sum_float(a_float) 982 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit a_float.sum() 383 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) </code></pre> <h3>Additional points</h3> <ul> <li>NumPy is outperforming (by quite a large margin) with floats and even beats its own integer sum.</li> <li>The performance difference for <code>sum_float</code> is the same with the <code>boundscheck</code> and <code>wraparound</code> directives missing. Why?</li> <li>Converting the integer numpy array in <code>sum_int</code> to a C pointer (<code>np.int64_t *arr = <np.int64_t *> a.data</code>) improves performance by an additional 25%. Doing so for the floats does nothing</li> </ul> <h3>Main Question</h3> How can I get the same performance in Cython with floats that I do with integers? <h3>EDIT - Just Counting is Slow?!?</h3> I wrote an even simpler function that just counts the number of iterations. The first stores the count as an int, the latter as a double. <pre class="prettyprint"><code>def count_int(): cdef: Py_ssize_t i, n = 1000000 int ct=0 for i in range(n): ct += 1 return ct def count_double(): cdef: Py_ssize_t i, n = 1000000 double ct=0 for i in range(n): ct += 1 return ct </code></pre> <h3>Timings of counting</h3> I ran these just once (afraid of caching). No idea if the loop is actually being executed for the integer, but <code>count_double</code> has the same performance as the <code>sum_float</code> from above. This is crazy... <pre class="prettyprint"><code>%timeit -n 1 -r 1 count_int() 1.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) %timeit -n 1 -r 1 count_double() 971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) </code></pre>

I'm not going to answer all your questions, but only (in my eyes) the most interesting ones. Let start with your counting example: <ol> <li>the compiler is able to optimize the for-loop in the integer-case - the resulting binary doesn't not calculate anything - it only hast to return the value precalculated during the compilation phase.</li> <li>This is not the case for the double-case, because due to rounding errors, the result will not be <code>1.0*10**6</code> and because cython compiles in IEEE 754 (not <code>-ffast-math</code>) mode per default.</li> </ol> This you must keep it mind when looking at your cython-code: the compiler is not allowed to rearrange the summations (IEEE 754) and because the result of the first summations is needed for the next there is only one long line in which all operations wait. But the most crucial insight: the numpy is not doing the same as your cython code: <pre class="prettyprint"><code>>>> sum_float(a_float)-a_float.sum() 2.9103830456733704e-08 </code></pre> Yep, nobody told numpy (differently from your cython code) that the sum has to be calculated like this <pre class="prettyprint"><code>((((a_1+a2)+a3)+a4)+... </code></pre> And numpy take advantage of it in two ways: <ol> <li>it performs pairwise summation (kind of), which leads to a smaller rounding error.</li> <li>it calculates sum in chunks (the code of python is somewhat hard to understand, here is the corresponding template and further down the listing of the used function <code>pairwise_sum_DOUBLE</code>)</li> </ol> The second point is the reason for the speed-up your are observing, the calculation happens similar to the following schema (at least what I understand from the source code below): <pre class="prettyprint"><code>a1 + a9 + ..... = r1 a2 + a10 + ..... = r2 .. a8 + a16 + = r8 ----> sum=r1+....+r8 </code></pre> The advantage of this kind of summation: the result of <code>a2+a10</code> doesn't depend on <code>a1+a9</code> and these both values can be calculated simultaneously (e.g. pipelining) on modern CPUs, which leads to the speed-up you are observing. <hr> For what it is worth, on my machine the cython-integer-sum is slower than numpy's. The need of taking the stride of the numpy-array into account (which is only known at the run-time, see also this question about vectorization) prevents some optimizations. A workaround is to use memory-views, for which you can make clear, that data is continuous, i.e.: <pre class="prettyprint"><code>def sum_int_cont(np.int64_t[::1] a): </code></pre> Which leads on my machine to a significant speed-up (factor 2): <pre class="prettyprint"><code>%timeit sum_int(a_int) 2.64 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit sum_int_cont(a_int) 1.31 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit a_int.sum() 2.1 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> It's true, that in this case using memory-views for doubles doesn't bring any speed-up (don't know why), but in general it simplifies the life of the optimizer. For example combining the memory-view-variant with <code>-ffast-math</code> compile options, which would allow the associativity, leads to a performance comparable with numpy: <pre class="prettyprint"><code>%%cython -c=-ffast-math cimport numpy as np def sum_float_cont(np.float64_t[::1] a): cdef: Py_ssize_t i, n = len(a) np.float64_t total = 0 for i in range(n): total += a[i] return total </code></pre> And now: <pre class="prettyprint"><code>>>> %timeit sum_float(a_float) 3.46 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit sum_float_cont(a_float) 1.87 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> %timeit a_float.sum() 1.41 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) </code></pre> <hr> Listing of <code>pairwise_sum_DOUBLE</code>: <pre class="prettyprint lang-C prettyprint-override"><code>/* * Pairwise summation, rounding error O(lg n) instead of O(n). * The recursion depth is O(lg n) as well. * when updating also update similar complex floats summation */ static npy_double pairwise_sum_DOUBLE(npy_double *a, npy_uintp n, npy_intp stride) { if (n < 8) { npy_intp i; npy_double res = 0.; for (i = 0; i < n; i++) { res += (a[i * stride]); } return res; } else if (n <= PW_BLOCKSIZE) { npy_intp i; npy_double r[8], res; /* * sum a block with 8 accumulators * 8 times unroll reduces blocksize to 16 and allows vectorization with * avx without changing summation ordering */ r[0] = (a[0 * stride]); r[1] = (a[1 * stride]); r[2] = (a[2 * stride]); r[3] = (a[3 * stride]); r[4] = (a[4 * stride]); r[5] = (a[5 * stride]); r[6] = (a[6 * stride]); r[7] = (a[7 * stride]); for (i = 8; i < n - (n % 8); i += 8) { r[0] += (a[(i + 0) * stride]); r[1] += (a[(i + 1) * stride]); r[2] += (a[(i + 2) * stride]); r[3] += (a[(i + 3) * stride]); r[4] += (a[(i + 4) * stride]); r[5] += (a[(i + 5) * stride]); r[6] += (a[(i + 6) * stride]); r[7] += (a[(i + 7) * stride]); } /* accumulate now to avoid stack spills for single peel loop */ res = ((r[0] + r[1]) + (r[2] + r[3])) + ((r[4] + r[5]) + (r[6] + r[7])); /* do non multiple of 8 rest */ for (; i < n; i++) { res += (a[i * stride]); } return res; } else { /* divide by two but avoid non-multiples of unroll factor */ npy_uintp n2 = n / 2; n2 -= n2 % 8; return pairwise_sum_DOUBLE(a, n2, stride) + pairwise_sum_DOUBLE(a + n2 * stride, n - n2, stride); } } </code></pre>

Large Performance difference when summing ints vs floats in Cython vs NumPy

Tags:

python

numpy

cython

I am summing each element in a 1D array using either Cython or NumPy. When summing integers Cython is ~20% faster. When summing floats, Cython is ~2.5x slower. Below are the two simple functions used.

#cython: boundscheck=False
#cython: wraparound=False

def sum_int(ndarray[np.int64_t] a):
    cdef:
        Py_ssize_t i, n = len(a)
        np.int64_t total = 0

    for i in range(n):
        total += a[i]
    return total 

def sum_float(ndarray[np.float64_t] a):
    cdef:
        Py_ssize_t i, n = len(a)
        np.float64_t total = 0

    for i in range(n):
        total += a[i]
    return total

Timings

Create two arrays of 1 million elements each:

a_int = np.random.randint(0, 100, 10**6)
a_float = np.random.rand(10**6)

%timeit sum_int(a_int)
394 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_int.sum()
490 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit sum_float(a_float)
982 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_float.sum()
383 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Additional points

NumPy is outperforming (by quite a large margin) with floats and even beats its own integer sum.
The performance difference for sum_float is the same with the boundscheck and wraparound directives missing. Why?
Converting the integer numpy array in sum_int to a C pointer (np.int64_t *arr = <np.int64_t *> a.data) improves performance by an additional 25%. Doing so for the floats does nothing

Main Question

How can I get the same performance in Cython with floats that I do with integers?

EDIT - Just Counting is Slow?!?

I wrote an even simpler function that just counts the number of iterations. The first stores the count as an int, the latter as a double.

def count_int():
    cdef:
        Py_ssize_t i, n = 1000000
        int ct=0

    for i in range(n):
        ct += 1
    return ct

def count_double():
    cdef:
        Py_ssize_t i, n = 1000000
        double ct=0

    for i in range(n):
        ct += 1
    return ct

Timings of counting

I ran these just once (afraid of caching). No idea if the loop is actually being executed for the integer, but count_double has the same performance as the sum_float from above. This is crazy...

%timeit -n 1 -r 1 count_int()
1.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n 1 -r 1 count_double()
971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

982

asked Mar 20 '18 16:03

Ted Petrou

1 Answers

I'm not going to answer all your questions, but only (in my eyes) the most interesting ones.

Let start with your counting example:

the compiler is able to optimize the for-loop in the integer-case - the resulting binary doesn't not calculate anything - it only hast to return the value precalculated during the compilation phase.
This is not the case for the double-case, because due to rounding errors, the result will not be 1.0*10**6 and because cython compiles in IEEE 754 (not -ffast-math) mode per default.

This you must keep it mind when looking at your cython-code: the compiler is not allowed to rearrange the summations (IEEE 754) and because the result of the first summations is needed for the next there is only one long line in which all operations wait.

But the most crucial insight: the numpy is not doing the same as your cython code:

>>> sum_float(a_float)-a_float.sum()
2.9103830456733704e-08

Yep, nobody told numpy (differently from your cython code) that the sum has to be calculated like this

((((a_1+a2)+a3)+a4)+...

And numpy take advantage of it in two ways:

it performs pairwise summation (kind of), which leads to a smaller rounding error.
it calculates sum in chunks (the code of python is somewhat hard to understand, here is the corresponding template and further down the listing of the used function pairwise_sum_DOUBLE)

The second point is the reason for the speed-up your are observing, the calculation happens similar to the following schema (at least what I understand from the source code below):

a1  + a9 + .....  = r1 
a2  + a10 + ..... = r2
..
a8  + a16 +       = r8

----> sum=r1+....+r8

The advantage of this kind of summation: the result of a2+a10 doesn't depend on a1+a9 and these both values can be calculated simultaneously (e.g. pipelining) on modern CPUs, which leads to the speed-up you are observing.

For what it is worth, on my machine the cython-integer-sum is slower than numpy's.

The need of taking the stride of the numpy-array into account (which is only known at the run-time, see also this question about vectorization) prevents some optimizations. A workaround is to use memory-views, for which you can make clear, that data is continuous, i.e.:

def sum_int_cont(np.int64_t[::1] a):

Which leads on my machine to a significant speed-up (factor 2):

%timeit sum_int(a_int)
2.64 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum_int_cont(a_int)
1.31 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_int.sum()
2.1 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's true, that in this case using memory-views for doubles doesn't bring any speed-up (don't know why), but in general it simplifies the life of the optimizer. For example combining the memory-view-variant with -ffast-math compile options, which would allow the associativity, leads to a performance comparable with numpy:

%%cython -c=-ffast-math
cimport numpy as np
def sum_float_cont(np.float64_t[::1] a):
    cdef:
        Py_ssize_t i, n = len(a)
        np.float64_t total = 0

    for i in range(n):
        total += a[i]
    return total

And now:

>>> %timeit sum_float(a_float)
3.46 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit sum_float_cont(a_float)
1.87 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit a_float.sum()
1.41 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Listing of pairwise_sum_DOUBLE:

/*
 * Pairwise summation, rounding error O(lg n) instead of O(n).
 * The recursion depth is O(lg n) as well.
 * when updating also update similar complex floats summation
 */
static npy_double
pairwise_sum_DOUBLE(npy_double *a, npy_uintp n, npy_intp stride)
{
    if (n < 8) {
        npy_intp i;
        npy_double res = 0.;
        for (i = 0; i < n; i++) {
            res += (a[i * stride]);
        }
        return res;
    }
    else if (n <= PW_BLOCKSIZE) {
        npy_intp i;
        npy_double r[8], res;

        /*
         * sum a block with 8 accumulators
         * 8 times unroll reduces blocksize to 16 and allows vectorization with
         * avx without changing summation ordering
         */
        r[0] = (a[0 * stride]);
        r[1] = (a[1 * stride]);
        r[2] = (a[2 * stride]);
        r[3] = (a[3 * stride]);
        r[4] = (a[4 * stride]);
        r[5] = (a[5 * stride]);
        r[6] = (a[6 * stride]);
        r[7] = (a[7 * stride]);

        for (i = 8; i < n - (n % 8); i += 8) {
            r[0] += (a[(i + 0) * stride]);
            r[1] += (a[(i + 1) * stride]);
            r[2] += (a[(i + 2) * stride]);
            r[3] += (a[(i + 3) * stride]);
            r[4] += (a[(i + 4) * stride]);
            r[5] += (a[(i + 5) * stride]);
            r[6] += (a[(i + 6) * stride]);
            r[7] += (a[(i + 7) * stride]);
        }

        /* accumulate now to avoid stack spills for single peel loop */
        res = ((r[0] + r[1]) + (r[2] + r[3])) +
              ((r[4] + r[5]) + (r[6] + r[7]));

        /* do non multiple of 8 rest */
        for (; i < n; i++) {
            res += (a[i * stride]);
        }
        return res;
    }
    else {
        /* divide by two but avoid non-multiples of unroll factor */
        npy_uintp n2 = n / 2;
        n2 -= n2 % 8;
        return pairwise_sum_DOUBLE(a, n2, stride) +
               pairwise_sum_DOUBLE(a + n2 * stride, n - n2, stride);
    }
}

160

answered Sep 29 '22 11:09

ead

Related questions
                            
                                Python stop multiple process when one returns a result?
                            
                                How to use python socket.settimeout() properly
                            
                                Context Manager without Yield
                            
                                How do I use perform_create to set a field automatically in Django Rest Framework?
                            
                                Use numba to speed up for loop
                            
                                Ignore SonarQube warnings in python
                            
                                Block Scoping in python - is it similar to javascript hoisting when inside a function?
                            
                                Cairo GTK draw a line with transparency (like a highlighter pen)
                            
                                How to highlight both a row and a column at once in pandas
                            
                                opencv Unable to stop the stream: Inappropriate ioctl for device
                            
                                numpy ufuncs speed vs for loop speed
                            
                                How to interpolate a list into an f-string in Python?
                            
                                Python 3 - ValueError: not enough values to unpack (expected 3, got 2)
                            
                                FileNotFoundError while writing to a Pickle file
                            
                                Checking if any character in a string is alphanumeric
                            
                                python 3.6 socket pickle data was truncated
                            
                                How to create modules in Jupyter notebook and import them? Python
                            
                                Recursion: account value with distributions
                            
                                How to start debugger with an entry_point script
                            
                                Specific package version with conda-forge

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With