Consider code like this: <pre class="prettyprint"><code>import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t] arr, int i): arr[i]+= 1 def test1(np.ndarray[np.int32_t] arr): cdef int i for i in xrange(len(arr)): inc(arr, i) def test2(np.ndarray[np.int32_t] arr): cdef int i for i in xrange(len(arr)): arr[i] += 1 </code></pre> I used ipython to measure speed of test1 and test2: <pre class="prettyprint"><code>In [7]: timeit ttt.test1(arr) 100 loops, best of 3: 6.13 ms per loop In [8]: timeit ttt.test2(arr) 100000 loops, best of 3: 9.79 us per loop </code></pre> Is there a way to optimize test1? Why doesn't cython inline this function as told? UPDATE: Actually what I need is multidimension code like this: <pre class="prettyprint"><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j): arr[i, j] += 1 def test1(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1 </code></pre> Timing for it: <pre class="prettyprint"><code>In [7]: timeit ttt.test1(arr) 1 loops, best of 3: 647 ms per loop In [8]: timeit ttt.test2(arr) 100 loops, best of 3: 2.07 ms per loop </code></pre> Explicit inlining gives 300x speedup. And my real function is quite big so inlining it makes code maintainability much worse UPDATE2: <pre class="prettyprint"><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j): arr[i, j]+= 1 def test1(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1 cdef class FastPassingFloat2DArray(object): cdef float* data cdef int stride0, stride1 def __init__(self, np.ndarray[np.float32_t, ndim=2] arr): self.data = <float*>arr.data self.stride0 = arr.strides[0]/arr.dtype.itemsize self.stride1 = arr.strides[1]/arr.dtype.itemsize def __getitem__(self, tuple tp): cdef int i, j cdef float *pr, r i, j = tp pr = (self.data + self.stride0*i + self.stride1*j) r = pr[0] return r def __setitem__(self, tuple tp, float value): cdef int i, j cdef float *pr, r i, j = tp pr = (self.data + self.stride0*i + self.stride1*j) pr[0] = value cdef inline inc2(FastPassingFloat2DArray arr, int i, int j): arr[i, j]+= 1 def test3(np.ndarray[np.float32_t, ndim=2] arr): cdef int i,j cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr) for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc2(tmparr, i,j) </code></pre> Timings: <pre class="prettyprint"><code>In [4]: timeit ttt.test1(arr) 1 loops, best of 3: 623 ms per loop In [5]: timeit ttt.test2(arr) 100 loops, best of 3: 2.29 ms per loop In [6]: timeit ttt.test3(arr) 1 loops, best of 3: 201 ms per loop </code></pre>

More than 3 years have passed since the question was posted and there have been great progress in the meantime. On this code (Update 2 of the question): <pre class="prettyprint"><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j): arr[i, j]+= 1 def test1(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) def test2(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): arr[i,j] += 1 </code></pre> I get the following timings: <pre class="prettyprint"><code>arr = np.zeros((1000,1000), dtype=np.int32) %timeit test1(arr) %timeit test2(arr) 1 loops, best of 3: 354 ms per loop 1000 loops, best of 3: 1.02 ms per loop </code></pre> So the problem is reproducible even after more than 3 years. Cython now has typed memoryviews, AFAIK it was introduced in Cython 0.16, so not available at the time the question was posted. With this: <pre class="prettyprint"><code># cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline inc(int[:, ::1] tmv, int i, int j): tmv[i, j]+= 1 def test3(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j cdef int[:, ::1] tmv = arr for i in xrange(tmv.shape[0]): for j in xrange(tmv.shape[1]): inc(tmv, i, j) def test4(np.ndarray[np.int32_t, ndim=2] arr): cdef int i,j cdef int[:, ::1] tmv = arr for i in xrange(tmv.shape[0]): for j in xrange(tmv.shape[1]): tmv[i,j] += 1 </code></pre> With this I get: <pre class="prettyprint"><code>arr = np.zeros((1000,1000), dtype=np.int32) %timeit test3(arr) %timeit test4(arr) 1000 loops, best of 3: 977 µs per loop 1000 loops, best of 3: 838 µs per loop </code></pre> We are almost there and already faster than the old-fashioned way! Now, the <code>inc()</code> function is eligible to be declared <code>nogil</code>, so let's declare it so! But oops: <pre class="prettyprint"><code>Error compiling Cython file: [...] cdef inline inc(int[:, ::1] tmv, int i, int j) nogil: ^ [...] Function with Python return type cannot be declared nogil </code></pre> Aaah, I totally missed that the <code>void</code> return type was missing! Once again but now with <code>void</code>: <pre class="prettyprint"><code>cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil: tmv[i, j]+= 1 </code></pre> And finally I get: <pre class="prettyprint"><code>%timeit test3(arr) %timeit test4(arr) 1000 loops, best of 3: 843 µs per loop 1000 loops, best of 3: 853 µs per loop </code></pre> As fast as manual inlining! <hr> Now, just for fun, I tried Numba on this code: <pre class="prettyprint"><code>import numpy as np from numba import autojit, jit @autojit def inc(arr, i, j): arr[i, j] += 1 @autojit def test5(arr): for i in xrange(arr.shape[0]): for j in xrange(arr.shape[1]): inc(arr, i, j) </code></pre> I get: <pre class="prettyprint"><code>arr = np.zeros((1000,1000), dtype=np.int32) %timeit test5(arr) 100 loops, best of 3: 4.03 ms per loop </code></pre> Even though it's 4.7x slower than Cython, most likely because the JIT compiler failed to inline <code>inc()</code>, I think it is AWESOME! All I needed to do is to add <code>@autojit</code> and didn't have to mess up the code with clumsy type declarations; 88x speedup for next to nothing! I have tried other things with Numba, such as <pre class="prettyprint"><code>@jit('void(i4[:],i4,i4)') def inc(arr, i, j): arr[i, j] += 1 </code></pre> or <code>nopython=True</code> but failed to improve it any further. Improving inlining is on the Numba developers' list, we only need to file more requests to make it have higher priority. ;)

Cython inline function with numpy array as parameter

Tags:

Consider code like this:

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t] arr, int i):
    arr[i]+= 1

def test1(np.ndarray[np.int32_t] arr):
    cdef int i
    for i in xrange(len(arr)):
        inc(arr, i)

def test2(np.ndarray[np.int32_t] arr):
    cdef int i
    for i in xrange(len(arr)):
        arr[i] += 1

I used ipython to measure speed of test1 and test2:

In [7]: timeit ttt.test1(arr)
100 loops, best of 3: 6.13 ms per loop

In [8]: timeit ttt.test2(arr)
100000 loops, best of 3: 9.79 us per loop

Is there a way to optimize test1? Why doesn't cython inline this function as told?

UPDATE: Actually what I need is multidimension code like this:

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
    arr[i, j] += 1

def test1(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)


def test2(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1

Timing for it:

In [7]: timeit ttt.test1(arr)
1 loops, best of 3: 647 ms per loop

In [8]: timeit ttt.test2(arr)
100 loops, best of 3: 2.07 ms per loop

Explicit inlining gives 300x speedup. And my real function is quite big so inlining it makes code maintainability much worse

UPDATE2:

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False

import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j):
  arr[i, j]+= 1

def test1(np.ndarray[np.float32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)


def test2(np.ndarray[np.float32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1    

cdef class FastPassingFloat2DArray(object):
    cdef float* data
    cdef int stride0, stride1 
    def __init__(self, np.ndarray[np.float32_t, ndim=2] arr):
        self.data = <float*>arr.data
        self.stride0 = arr.strides[0]/arr.dtype.itemsize
        self.stride1 = arr.strides[1]/arr.dtype.itemsize
    def __getitem__(self, tuple tp):
        cdef int i, j
        cdef float *pr, r
        i, j = tp        
        pr = (self.data + self.stride0*i + self.stride1*j)
        r = pr[0]
        return r
    def __setitem__(self, tuple tp, float value):
        cdef int i, j
        cdef float *pr, r
        i, j = tp        
        pr = (self.data + self.stride0*i + self.stride1*j)
        pr[0] = value        


cdef inline inc2(FastPassingFloat2DArray arr, int i, int j):
    arr[i, j]+= 1


def test3(np.ndarray[np.float32_t, ndim=2] arr):    
    cdef int i,j    
    cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr)
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc2(tmparr, i,j)

Timings:

In [4]: timeit ttt.test1(arr)
1 loops, best of 3: 623 ms per loop

In [5]: timeit ttt.test2(arr)
100 loops, best of 3: 2.29 ms per loop

In [6]: timeit ttt.test3(arr)
1 loops, best of 3: 201 ms per loop

232

asked Jan 09 '11 19:01

Maxim

1 Answers

More than 3 years have passed since the question was posted and there have been great progress in the meantime. On this code (Update 2 of the question):

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
    arr[i, j]+= 1

def test1(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)

def test2(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j    
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            arr[i,j] += 1

I get the following timings:

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test1(arr)
%timeit test2(arr)
   1 loops, best of 3: 354 ms per loop
1000 loops, best of 3: 1.02 ms per loop

So the problem is reproducible even after more than 3 years. Cython now has typed memoryviews, AFAIK it was introduced in Cython 0.16, so not available at the time the question was posted. With this:

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

cdef inline inc(int[:, ::1] tmv, int i, int j):
    tmv[i, j]+= 1

def test3(np.ndarray[np.int32_t, ndim=2] arr):
    cdef int i,j
    cdef int[:, ::1] tmv = arr
    for i in xrange(tmv.shape[0]):
        for j in xrange(tmv.shape[1]):
            inc(tmv, i, j)

def test4(np.ndarray[np.int32_t, ndim=2] arr):    
    cdef int i,j
    cdef int[:, ::1] tmv = arr
    for i in xrange(tmv.shape[0]):
        for j in xrange(tmv.shape[1]):
            tmv[i,j] += 1

With this I get:

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 977 µs per loop
1000 loops, best of 3: 838 µs per loop

We are almost there and already faster than the old-fashioned way! Now, the inc() function is eligible to be declared nogil, so let's declare it so! But oops:

Error compiling Cython file:
[...]

cdef inline inc(int[:, ::1] tmv, int i, int j) nogil:
    ^
[...]
Function with Python return type cannot be declared nogil

Aaah, I totally missed that the void return type was missing! Once again but now with void:

cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil:
    tmv[i, j]+= 1

And finally I get:

%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 843 µs per loop
1000 loops, best of 3: 853 µs per loop

As fast as manual inlining!

Now, just for fun, I tried Numba on this code:

import numpy as np
from numba import autojit, jit

@autojit
def inc(arr, i, j):
    arr[i, j] += 1

@autojit
def test5(arr):
    for i in xrange(arr.shape[0]):
        for j in xrange(arr.shape[1]):
            inc(arr, i, j)

I get:

arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test5(arr)
100 loops, best of 3: 4.03 ms per loop

Even though it's 4.7x slower than Cython, most likely because the JIT compiler failed to inline inc(), I think it is AWESOME! All I needed to do is to add @autojit and didn't have to mess up the code with clumsy type declarations; 88x speedup for next to nothing!

I have tried other things with Numba, such as

@jit('void(i4[:],i4,i4)')
def inc(arr, i, j):
    arr[i, j] += 1

or nopython=True but failed to improve it any further.

Improving inlining is on the Numba developers' list, we only need to file more requests to make it have higher priority. ;)

104

answered Oct 01 '22 16:10

Ali

Related questions
                            
                                Rotating a drawable in Android
                            
                                Horizontal List that wraps without overlapping
                            
                                How do you limit retry & rescue in this Ruby example?
                            
                                Python IDLE: Change Python Version
                            
                                Illegal character - CTRL-CHAR
                            
                                Cannot open include file "d3dx9.h"
                            
                                why string.length return undefined?
                            
                                Magento - How to allow certain tags (iframe, embed) in Magento's CMS editor?
                            
                                Converting Gregorian date to Hijri date
                            
                                Creating a tablespace in postgresql
                            
                                How to specify outbound certificate alias for HTTPS-calls?
                            
                                How do i rotate a window in xCode 4 interface builder tool thing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cython inline function with numpy array as parameter

Tags:

Maxim

People also ask

1 Answers

Ali

Recent Activity

Donate For Us