I'm trying to speed up a piece of code convolving a 1D array (filter) over each column of a 2D array. Somehow, when I run it with numba's njit
, I get a 7x slow down. My thoughts:
(tested on Windows 10, python 3.9.4 from conda, numpy 1.12.2, numba 0.53.1)
Can anyone tell me why this code is slow?
import numpy as np
from numba import njit
def f1(a1, filt):
l2 = filt.size // 2
res = np.empty(a1.shape)
for i in range(a1.shape[1]):
res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
return res
@njit
def f1_jit(a1, filt):
l2 = filt.size // 2
res = np.empty(a1.shape)
for i in range(a1.shape[1]):
res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
return res
a1 = np.random.random((6400, 1000))
filt = np.random.random((65))
f1(a1, filt)
f1_jit(a1, filt)
%timeit f1(a1, filt) # 404 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f1_jit(a1, filt) # 2.8 s ± 66.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba can speed things up Of course, it turns out that NumPy has a function that will do this already, numpy. maximum. accumulate . Using that, running only takes 0.03 seconds.
Returns the discrete, linear convolution of two one-dimensional sequences. The convolution operator is often seen in signal processing, where it models the effect of a linear time-invariant system on a signal [1].
Numba is a just-in-time compiler for Python that works best on code that uses NumPy arrays and functions, and loops.
Matrix Multiplications Matrix multiplication is another example that shows how Numba could be useful to boost up the processing time. Even without Cuda, we could achieve better performance.
As shown, after the first call, the Numba version of the function is faster than the Numpy version. In the same time, if we call again the Numpy version, it take a similar run time. This demonstrates well the effect of compiling in Numba.
In fact, the ratio of the Numpy and Numba run time will depends on both datasize, and the number of loops, or more general the nature of the function (to be compiled). Below is just an example of Numpy/Numba runtime ratio over those two parameters.
The time taken by the numba version is similarish to your time. 20 ms although the time is pretty noisy 15-25 running timeit a couple times, but the numpy version is notably faster (4ish ms). Is there any device/hardware/library information you think might be helpful?
In a nutshell, a python function can be converted into Numba function simply by using the decorator "@jit". When compiling this function, Numba will look at its Bytecode to find the operators and also unbox the function’s arguments to find out the variables’ types.
The problem comes from the Numba implementation of np.convolve
. This is a known issue. It turns out that the current Numba implementation is much slower than the one of Numpy (version <=0.54.1 tested on Windows).
On one hand, the Numpy implementation call correlate
which itself performs a dot product that should be implemented by the fast BLAS library available on your system. On the other hand, the Numba implementation calls _get_inner_prod
which use np.dot
that should also use the same BLAS library (assuming a BLAS is detected which should be the case)...
That being said, there are multiple issues related to the dot product:
First of all, if the internal variable _HAVE_BLAS
of numba/np/arraymath.py
is manually disabled, Numba use a fallback implementation of the dot product supposed to be significantly slower. However, it turns out that using the fallback dot product implementation used by np.convolve
result in a 5 times faster execution than with the BLAS wrapper on my machine! Using additionally the parameter fastmath=True
in the njit
Numba decorator results in an overall 8.7 times faster execution! Here is the testing code:
import numpy as np
import numba as nb
def npConvolve(a, b):
return np.convolve(a, b)
@nb.njit('float64[:](float64[:], float64[:])')
def nbConvolveUncont(a, b):
return np.convolve(a, b)
@nb.njit('float64[::1](float64[::1], float64[::1])')
def nbConvolveCont(a, b):
return np.convolve(a, b)
a = np.random.random(6400)
b = np.random.random(65)
%timeit -n 100 npConvolve(a, b)
%timeit -n 100 nbConvolveUncont(a, b)
%timeit -n 100 nbConvolveCont(a, b)
Here are raw interesting results:
With _HAVE_BLAS=True (default):
126 µs ± 292 ns per loop
1.6 ms ± 21.3 µs per loop
1.6 ms ± 18.5 µs per loop
With _HAVE_BLAS=False:
125 µs ± 359 ns per loop
311 µs ± 1.18 µs per loop
268 µs ± 4.26 µs per loop
With _HAVE_BLAS=False and fastmath=True:
125 µs ± 757 ns per loop
327 µs ± 3.69 µs per loop
183 µs ± 654 ns per loop
Moreover, np_convolve
of Numba internally flip some array parameter and then perform the dot product using a flipped array that have a non trivial stride (ie. not 1). Such a non-trivial stride may have an impact on the dot product performance. More generally, any transformation preventing the compiler to know that arrays are contiguous will certainly strongly impact the performances. Indeed, the following test shows the impact of working on a contiguous array with the dot product implementation of Numba:
import numpy as np
import numba as nb
def np_dot(a, b):
return np.dot(a, b)
@nb.njit('float64(float64[::1], float64[::1])')
def nb_dot_cont(a, b):
return np.dot(a, b)
@nb.njit('float64(float64[::1], float64[:])')
def nb_dot_stride(a, b):
return np.dot(a, b)
v = np.random.random(128*1024)
%timeit -n 200 np_dot(v, v) # 36.5 µs ± 4.9 µs per loop
%timeit -n 200 nb_dot_stride(v, v) # 361.0 µs ± 17.1 µs per loop (x10 !!!)
%timeit -n 200 nb_dot_cont(v, v) # 34.1 µs ± 2.9 µs per loop
Note that Numba can hardly accelerate the Numpy calls when they work on pretty-big arrays since Numba re-implement Numpy functions mostly in Python and use a JIT compiler (LLVM-Lite) to speed them up, while Numpy is mostly implemented in plain-C (with a rather-slow Python wrapping code). The Numpy code uses low-level processor features like SIMD instructions to speed up a lot the execution of many functions. Both appear to use BLAS libraries known to be highly optimized. Numpy tends to be generally more optimized since Numpy is currently more mature than Numba: Numpy has much more contributors working since a longer time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With