Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cython to accelerate loops with array broadcast

Summary:

You guys are too awesome... I got my real code working. I took JoshAdel's advice, namely the following:

1) Changed all ndarray to typed memoryviews 2) Unrolled all the numpy array calculations manually 3) Used statically defined unsigned int for the index 4) Disabled the boundscheck and wraparound

And also, thanks a lot to Veedrac's insight!

Original post:

I know that python do these kind of code really slow:

import numpy as np

def func0():
    x = 0.
    for i in range(1000):
        x += 1.
    return

And if I change this to Cython, it can be much faster:

import numpy as np
cimport numpy as np

def func0():
    cdef double x = 0.
    for i in range(1000):
        x += 1.
    return

And here is the result:

# Python
10000 loops, best of 3: 58.9 µs per loop
# Cython
10000000 loops, best of 3: 66.8 ns per loop

However, now I have this kind of code, where it is not loops of single number, but loops of arrays. (Yes... I am solving a PDE so this happens).

I know the following example is stupid, but just try to get the idea of the type of calculation:

Pure python:

def func1():
    array1 = np.random.rand(50000, 4)
    array2 = np.random.rand(50000)
    for i in range(1000):
        array1[:, 0] += array2
        array1[:, 1] += array2
        array1[:, 2] += array2
        array1[:, 3] += array2
    return

Cython:

def func1():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    for i in range(1000):
        array1[:, 0] += array2
        array1[:, 1] += array2
        array1[:, 2] += array2
        array1[:, 3] += array2
    return

And there is no improvement almost at all. In the meantime, I do know that Python is not good at handling these huge loops due to big overhead.

# Python
1 loops, best of 3: 299 ms per loop
# Cython
1 loops, best of 3: 300 ms per loop

Any suggestions in how I should improve these kind of code?

like image 700
Yuxiang Wang Avatar asked Mar 03 '26 17:03

Yuxiang Wang


1 Answers

In these two other implementations, I've played around with

  • Using cython compiler directives to remove some of the checks that numpy usually has to make
  • Use typed memoryviews so that I can specify memory layout (and sometimes they are faster in general compared to the older buffer interface)
  • Unrolled the loops so that we don't use numpy's slice machinary:

Otherwise, you are just using numpy via cython, and numpy already implements this in fairly fast c code under the hood.

The methods:

import numpy as np
cimport numpy as np

cimport cython

def func1():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    cdef unsigned int i
    for i in range(1000):
        array1[:, 0] += array2
        array1[:, 1] += array2
        array1[:, 2] += array2
        array1[:, 3] += array2
    return

@cython.boundscheck(False)
@cython.wraparound(False)
def func2():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    cdef unsigned int i, k
    for i in range(1000):
        for k in xrange(50000):
            array1[k, 0] += array2[k]
            array1[k, 1] += array2[k]
            array1[k, 2] += array2[k]
            array1[k, 3] += array2[k]
    return


@cython.boundscheck(False)
@cython.wraparound(False)
def func3():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    cdef np.double_t[::1] a2 = array2
    cdef np.double_t[:,::1] a1 = array1
    cdef unsigned int i, k

    for i in range(1000):
        for k in xrange(50000):
            a1[k, 0] += a2[k]
            a1[k, 1] += a2[k]
            a1[k, 2] += a2[k]
            a1[k, 3] += a2[k]
    return

The timings (on my machine - compilers and hardware may of course effect the timings):

  • Pure numpy: 1 loops, best of 3: 419 ms per loop
  • Your original cython function with typing i: 1 loops, best of 3: 428 ms per loop
  • func2: 1 loops, best of 3: 336 ms per loop
  • func3: 1 loops, best of 3: 206 ms per loop
like image 192
JoshAdel Avatar answered Mar 05 '26 07:03

JoshAdel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!