Why is Cython so much slower than Numba when iterating over NumPy arrays?

Pure Python code:

import numpy as np

def f(arr):
  res=np.zeros(len(arr))
   
  for i in range(len(arr)):
     res[i]=(arr[i])**2
    
  return res

arr=np.random.rand(10000)
%timeit f(arr)

out: 4.81 ms ± 72.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Cython code (within Jupyter):

%load_ext cython
%%cython

import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport pow

#@cython.boundscheck(False)
#@cython.wraparound(False)

cpdef f(double[:] arr):
   cdef np.ndarray[dtype=np.double_t, ndim=1] res
   res=np.zeros(len(arr),dtype=np.double)
   cdef double[:] res_view=res
   cdef int i

   for i in range(len(arr)):
      res_view[i]=pow(arr[i],2)
    
   return res

arr=np.random.rand(10000)
%timeit f(arr)

Out:445 µs ± 5.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Numba code:

import numpy as np
import numba as nb

@nb.jit(nb.float64[:](nb.float64[:]))
def   f(arr):
   res=np.zeros(len(arr))
   
   for i in range(len(arr)):
       res[i]=(arr[i])**2
    
   return res

arr=np.random.rand(10000)
%timeit f(arr)

Out:9.59 µs ± 98.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this example, Numba is almost 50 times faster than Cython.
Being a Cython beginner, I guess I am missing something.

Of course in this simple case using the NumPy square vectorized function would have been far more suitable:

%timeit np.square(arr)

Out:5.75 µs ± 78.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

622

asked Nov 06 '18 11:11

Greg A

1 Answers

As @Antonio has pointed out, using pow for a simple multiplication is not very wise and leads to quite an overhead:

Thus, replacing pow(arr[i], 2) through arr[i]*arr[i] leads to a pretty large speed-up:

cython-pow-version        356 µs
numba-version              11 µs
cython-mult-version        14 µs

The remaining difference is probably due to difference between the compilers and levels of optimizations (llvm vs MSVC in my case). You might want to use clang to match numba performance (see for example this SO-answer)

In order to make the optimization easier for the compiler, you should declare the input as continuous array, i.e. double[::1] arr (see this question why it is important for vectorization), use @cython.boundscheck(False) (use option -a to see that there is less yellow) and also add compiler flags (i.e. -O3, -march=native or similar depending on your compiler to enable the vectorization, watch out for build-flags used by default which can inhibit some optimization, for example -fwrapv). In the end you might want to write the working-horse-loop in C, compile with the right combination of flags/compiler and use Cython to wrap it.

By the way, by typing function's paramters as nb.float64[:](nb.float64[:]) you decrease the performance of numba - it is no longer allowed to assume that the input array is continuous, thus ruling the vectorization out. Let numba detect the types (or define it as continuous, i.e. nb.float64[::1](nb.float64[::1]), and you will get better performance:

@nb.jit(nopython=True)
def nb_vec_f(arr):
   res=np.zeros(len(arr))

   for i in range(len(arr)):
       res[i]=(arr[i])**2

   return res

Leads to the following improvement:

%timeit f(arr)  # numba version
# 11.4 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit nb_vec_f(arr)
# 7.03 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And as pointed out by @max9111, we don't have to initialize the resulting array with zeros, but can use np.empty(...) instead of np.zeros(...) - this version even beats the numpy's np.square()

The performances of different approaches on my machine are:

numba+vectorization+empty     3µs
np.square                     4µs
numba+vectorization           7µs
numba missed vectorization   11µs
cython+mult                  14µs
cython+pow                  356µs

132

answered Oct 22 '22 18:10

ead

Related questions
                            
                                Problems implementing an XOR gate with Neural Nets in Tensorflow
                            
                                Interpolating a closed curve using scipy
                            
                                How do I order fields of my Row objects in Spark (Python)
                            
                                How can I send an email using python logging's SMTPHandler and SSL
                            
                                Doing pairwise distance computation with TensorFlow
                            
                                How to fillna() with value 0 after calling resample?
                            
                                Spyder / iPython inline plot figure size
                            
                                Why does a class need __iter__() to return an iterator?
                            
                                ValueError: time data does not match format '%Y-%m-%d %H:%M:%S.%f'
                            
                                reshape a pandas dataframe
                            
                                Difference between dictionary and pandas series in Python
                            
                                How to use an update function to animate a NetworkX graph in Matplotlib 2.0.0?
                            
                                using Tensorflow with Anaconda and PyCharm on Windows
                            
                                AttributeError: module 'cv2.cv2' has no attribute 'cv'
                            
                                Python pandas - new column's value if the item is in the list
                            
                                Find indices of duplicate rows in pandas DataFrame
                            
                                Changing the fill_values in a SparseDataFrame - replace throws TypeError
                            
                                VS Code shows an error message at print statement in python 2.7
                            
                                numpy.unique sort based on counts
                            
                                How to construct an in-memory virtual file system and then write this structure to disk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is Cython so much slower than Numba when iterating over NumPy arrays?

Tags:

python

numpy

cython

numba