I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).
Here is a stupid example of implementation in cython :
%%cython -f -c=-O2 -I./ 
import numpy as np
cimport numpy as np
cimport cython
cdef class Test:
    
    cdef double[:, ::1] a, b
    cdef Py_ssize_t[:, ::1] indices
    
    def __cinit__(self, a, b, indices):
        self.a = a
        self.b = b
        self.indices = indices
    
    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run1(self):
        """ Use of external structure of indices. """
        cdef Py_ssize_t idx, ix, iy
        cdef int n = self.indices.shape[0]
        
        
        for idx in range(n):
            ix = self.indices[idx, 0]
            iy = self.indices[idx, 1]
            self.b[ix, iy] = ix * iy * self.a[ix, iy]
    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run2(self):
        """ Direct formulation """
        cdef Py_ssize_t idx, ix, iy
        cdef int nx = self.a.shape[0]
        cdef int ny = self.a.shape[1]
        
        for ix in range(nx):
            for iy in range(ny):
                self.b[ix, iy] = ix * iy * self.a[ix, iy]
with this on the python side :
import itertools
import numpy as np
N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)
and the results :
%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Why does the Test.run1() method run much slower than the Test.run2() method?
What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?
Because run1 is significantly more complicated...
run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.Your current performance is probably as good as it gets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With