Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cython inline function with numpy array as parameter

Tags:

Consider code like this:

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t] arr, int i):
    arr[i]+= 1

def test1(np.ndarray[np.int32_t] arr):
    cdef int i
    for i in xrange(len(arr)):
        inc(arr, i)

def test2(np.ndarray[np.int32_t] arr):
    cdef int i
    for i in xrange(len(arr)):
        arr[i] += 1

I used ipython to measure speed of test1 and test2:

In [7]: timeit ttt.test1(arr)
100 loops, best of 3: 6.13 ms per loop

In [8]: timeit ttt.test2(arr)
100000 loops, best of 3: 9.79 us per loop

Is there a way to optimize test1? Why doesn't cython inline this function as told?

UPDATE: Actually what I need is multidimension code like this:

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
    arr[i, j] += 1

def test1(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)


def test2(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1   

Timing for it:

In [7]: timeit ttt.test1(arr)
1 loops, best of 3: 647 ms per loop

In [8]: timeit ttt.test2(arr)
100 loops, best of 3: 2.07 ms per loop

Explicit inlining gives 300x speedup. And my real function is quite big so inlining it makes code maintainability much worse

UPDATE2:

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j):
  arr[i, j]+= 1

def test1(np.ndarray[np.float32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)


def test2(np.ndarray[np.float32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1    

cdef class FastPassingFloat2DArray(object):
    cdef float* data
    cdef int stride0, stride1 
    def __init__(self, np.ndarray[np.float32_t, ndim=2] arr):
        self.data = <float*>arr.data
        self.stride0 = arr.strides[0]/arr.dtype.itemsize
        self.stride1 = arr.strides[1]/arr.dtype.itemsize
    def __getitem__(self, tuple tp):
        cdef int i, j
        cdef float *pr, r
        i, j = tp        
        pr = (self.data + self.stride0*i + self.stride1*j)
        r = pr[0]
        return r
    def __setitem__(self, tuple tp, float value):
        cdef int i, j
        cdef float *pr, r
        i, j = tp        
        pr = (self.data + self.stride0*i + self.stride1*j)
        pr[0] = value        


cdef inline inc2(FastPassingFloat2DArray arr, int i, int j):
    arr[i, j]+= 1


def test3(np.ndarray[np.float32_t, ndim=2] arr):    
    cdef int i,j    
    cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr)
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc2(tmparr, i,j)

Timings:

In [4]: timeit ttt.test1(arr)
1 loops, best of 3: 623 ms per loop

In [5]: timeit ttt.test2(arr)
100 loops, best of 3: 2.29 ms per loop

In [6]: timeit ttt.test3(arr)
1 loops, best of 3: 201 ms per loop
like image 232
Maxim Avatar asked Jan 09 '11 19:01

Maxim


People also ask

Can you use NumPy with Cython?

You can use NumPy from Cython exactly the same as in regular Python, but by doing so you are losing potentially high speedups because Cython has support for fast access to NumPy arrays.

Does Cython improve NumPy?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

What is Cimport?

The cimport statement is used in a definition or implementation file to gain access to names declared in another definition file. Its syntax exactly parallels that of the normal Python import statement.


1 Answers

More than 3 years have passed since the question was posted and there have been great progress in the meantime. On this code (Update 2 of the question):

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
    arr[i, j]+= 1

def test1(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)

def test2(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1

I get the following timings:

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test1(arr)
%timeit test2(arr)
   1 loops, best of 3: 354 ms per loop
1000 loops, best of 3: 1.02 ms per loop

So the problem is reproducible even after more than 3 years. Cython now has typed memoryviews, AFAIK it was introduced in Cython 0.16, so not available at the time the question was posted. With this:

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

cdef inline inc(int[:, ::1] tmv, int i, int j):
    tmv[i, j]+= 1

def test3(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j
    cdef int[:, ::1] tmv = arr
    for i in xrange(tmv.shape[0]):
        for j in xrange(tmv.shape[1]):
            inc(tmv, i, j)

def test4(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j
    cdef int[:, ::1] tmv = arr
    for i in xrange(tmv.shape[0]):
        for j in xrange(tmv.shape[1]):
            tmv[i,j] += 1

With this I get:

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 977 µs per loop
1000 loops, best of 3: 838 µs per loop

We are almost there and already faster than the old-fashioned way! Now, the inc() function is eligible to be declared nogil, so let's declare it so! But oops:

Error compiling Cython file:
[...]

cdef inline inc(int[:, ::1] tmv, int i, int j) nogil:
    ^
[...]
Function with Python return type cannot be declared nogil

Aaah, I totally missed that the void return type was missing! Once again but now with void:

cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil:
    tmv[i, j]+= 1

And finally I get:

%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 843 µs per loop
1000 loops, best of 3: 853 µs per loop

As fast as manual inlining!


Now, just for fun, I tried Numba on this code:

import numpy as np
from numba import autojit, jit

@autojit
def inc(arr, i, j):
    arr[i, j] += 1

@autojit
def test5(arr):
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)

I get:

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test5(arr)
100 loops, best of 3: 4.03 ms per loop

Even though it's 4.7x slower than Cython, most likely because the JIT compiler failed to inline inc(), I think it is AWESOME! All I needed to do is to add @autojit and didn't have to mess up the code with clumsy type declarations; 88x speedup for next to nothing!

I have tried other things with Numba, such as

@jit('void(i4[:],i4,i4)')
def inc(arr, i, j):
    arr[i, j] += 1

or nopython=True but failed to improve it any further.

Improving inlining is on the Numba developers' list, we only need to file more requests to make it have higher priority. ;)

like image 104
Ali Avatar answered Oct 01 '22 16:10

Ali