When iterating over NumPy arrays, Numba seems dramatically faster than Cython.
What Cython optimizations am I possibly missing?
Here is a simple example:
import numpy as np
def f(arr):
res=np.zeros(len(arr))
for i in range(len(arr)):
res[i]=(arr[i])**2
return res
arr=np.random.rand(10000)
%timeit f(arr)
out: 4.81 ms ± 72.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%load_ext cython
%%cython
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport pow
#@cython.boundscheck(False)
#@cython.wraparound(False)
cpdef f(double[:] arr):
cdef np.ndarray[dtype=np.double_t, ndim=1] res
res=np.zeros(len(arr),dtype=np.double)
cdef double[:] res_view=res
cdef int i
for i in range(len(arr)):
res_view[i]=pow(arr[i],2)
return res
arr=np.random.rand(10000)
%timeit f(arr)
Out:445 µs ± 5.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import numpy as np
import numba as nb
@nb.jit(nb.float64[:](nb.float64[:]))
def f(arr):
res=np.zeros(len(arr))
for i in range(len(arr)):
res[i]=(arr[i])**2
return res
arr=np.random.rand(10000)
%timeit f(arr)
Out:9.59 µs ± 98.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In this example, Numba is almost 50 times faster than Cython.
Being a Cython beginner, I guess I am missing something.
Of course in this simple case using the NumPy square
vectorized function would have been far more suitable:
%timeit np.square(arr)
Out:5.75 µs ± 78.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Both Cython and Numba speeds up Python code even small number of operations. More the number of operations more is the speed up. However, performance gain by Cython saturates at around 100-150 times of Python. On the other hand, speed up gain by Numba increases steadily with number of operations.
This is because there is an overhead to set up tasks in parallel. For a larger array, the same implementation becomes the fastest. For an array of size 1000*1000, the Numba(nopython) takes more time than pure NumPy implementation.
Cython is easier to distribute than Numba, which makes it a better option for user facing libraries. It's the preferred option for most of the scientific Python stack, including NumPy, SciPy, pandas and Scikit-Learn. In contrast, there are very few libraries that use Numba.
As @Antonio has pointed out, using pow
for a simple multiplication is not very wise and leads to quite an overhead:
Thus, replacing pow(arr[i], 2)
through arr[i]*arr[i]
leads to a pretty large speed-up:
cython-pow-version 356 µs
numba-version 11 µs
cython-mult-version 14 µs
The remaining difference is probably due to difference between the compilers and levels of optimizations (llvm vs MSVC in my case). You might want to use clang to match numba performance (see for example this SO-answer)
In order to make the optimization easier for the compiler, you should declare the input as continuous array, i.e. double[::1] arr
(see this question why it is important for vectorization), use @cython.boundscheck(False)
(use option -a
to see that there is less yellow) and also add compiler flags (i.e. -O3
, -march=native
or similar depending on your compiler to enable the vectorization, watch out for build-flags used by default which can inhibit some optimization, for example -fwrapv). In the end you might want to write the working-horse-loop in C, compile with the right combination of flags/compiler and use Cython to wrap it.
By the way, by typing function's paramters as nb.float64[:](nb.float64[:])
you decrease the performance of numba - it is no longer allowed to assume that the input array is continuous, thus ruling the vectorization out. Let numba detect the types (or define it as continuous, i.e. nb.float64[::1](nb.float64[::1]
), and you will get better performance:
@nb.jit(nopython=True)
def nb_vec_f(arr):
res=np.zeros(len(arr))
for i in range(len(arr)):
res[i]=(arr[i])**2
return res
Leads to the following improvement:
%timeit f(arr) # numba version
# 11.4 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit nb_vec_f(arr)
# 7.03 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And as pointed out by @max9111, we don't have to initialize the resulting array with zeros, but can use np.empty(...)
instead of np.zeros(...)
- this version even beats the numpy's np.square()
The performances of different approaches on my machine are:
numba+vectorization+empty 3µs
np.square 4µs
numba+vectorization 7µs
numba missed vectorization 11µs
cython+mult 14µs
cython+pow 356µs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With