Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numba np.convolve really slow

I'm trying to speed up a piece of code convolving a 1D array (filter) over each column of a 2D array. Somehow, when I run it with numba's njit, I get a 7x slow down. My thoughts:

  • Maybe column indexing is slowing it down, but switching to row indexing didn't affect performance
  • Maybe slice indexing the results of the convolution is slow, but removing it didn't change anything
  • I've checked that numba understands all the types properly

(tested on Windows 10, python 3.9.4 from conda, numpy 1.12.2, numba 0.53.1)

Can anyone tell me why this code is slow?

import numpy as np
from numba import njit

def f1(a1, filt):
    l2 = filt.size // 2
    res = np.empty(a1.shape)
    for i in range(a1.shape[1]):
        res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
    return res

@njit
def f1_jit(a1, filt):
    l2 = filt.size // 2
    res = np.empty(a1.shape)
    for i in range(a1.shape[1]):
        res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
    return res

a1 = np.random.random((6400, 1000))
filt = np.random.random((65))
f1(a1, filt)
f1_jit(a1, filt)

%timeit f1(a1, filt)     # 404 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f1_jit(a1, filt) # 2.8 s ± 66.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 426
kwsp Avatar asked Dec 10 '21 23:12

kwsp


People also ask

Does Numba speed up NumPy?

Numba can speed things up Of course, it turns out that NumPy has a function that will do this already, numpy. maximum. accumulate . Using that, running only takes 0.03 seconds.

What is convolve method NP convolve for?

Returns the discrete, linear convolution of two one-dimensional sequences. The convolution operator is often seen in signal processing, where it models the effect of a linear time-invariant system on a signal [1].

Can I use Numba with NumPy?

Numba is a just-in-time compiler for Python that works best on code that uses NumPy arrays and functions, and loops.

Does Numba support matrix multiplication?

Matrix Multiplications Matrix multiplication is another example that shows how Numba could be useful to boost up the processing time. Even without Cuda, we could achieve better performance.

Is Numba faster than NumPy?

As shown, after the first call, the Numba version of the function is faster than the Numpy version. In the same time, if we call again the Numpy version, it take a similar run time. This demonstrates well the effect of compiling in Numba.

What is the ratio of NumPy and Numba run time?

In fact, the ratio of the Numpy and Numba run time will depends on both datasize, and the number of loops, or more general the nature of the function (to be compiled). Below is just an example of Numpy/Numba runtime ratio over those two parameters.

What is the time taken by the Numba version?

The time taken by the numba version is similarish to your time. 20 ms although the time is pretty noisy 15-25 running timeit a couple times, but the numpy version is notably faster (4ish ms). Is there any device/hardware/library information you think might be helpful?

How to convert a python function to Numba function?

In a nutshell, a python function can be converted into Numba function simply by using the decorator "@jit". When compiling this function, Numba will look at its Bytecode to find the operators and also unbox the function’s arguments to find out the variables’ types.


1 Answers

The problem comes from the Numba implementation of np.convolve. This is a known issue. It turns out that the current Numba implementation is much slower than the one of Numpy (version <=0.54.1 tested on Windows).


Under the hood

On one hand, the Numpy implementation call correlate which itself performs a dot product that should be implemented by the fast BLAS library available on your system. On the other hand, the Numba implementation calls _get_inner_prod which use np.dot that should also use the same BLAS library (assuming a BLAS is detected which should be the case)...

That being said, there are multiple issues related to the dot product:

First of all, if the internal variable _HAVE_BLAS of numba/np/arraymath.py is manually disabled, Numba use a fallback implementation of the dot product supposed to be significantly slower. However, it turns out that using the fallback dot product implementation used by np.convolve result in a 5 times faster execution than with the BLAS wrapper on my machine! Using additionally the parameter fastmath=True in the njit Numba decorator results in an overall 8.7 times faster execution! Here is the testing code:

import numpy as np
import numba as nb

def npConvolve(a, b):
    return np.convolve(a, b)

@nb.njit('float64[:](float64[:], float64[:])')
def nbConvolveUncont(a, b):
    return np.convolve(a, b)

@nb.njit('float64[::1](float64[::1], float64[::1])')
def nbConvolveCont(a, b):
    return np.convolve(a, b)

a = np.random.random(6400)
b = np.random.random(65)
%timeit -n 100 npConvolve(a, b)
%timeit -n 100 nbConvolveUncont(a, b)
%timeit -n 100 nbConvolveCont(a, b)

Here are raw interesting results:

With _HAVE_BLAS=True (default):
126 µs ± 292 ns per loop
1.6 ms ± 21.3 µs per loop
1.6 ms ± 18.5 µs per loop

With _HAVE_BLAS=False:
125 µs ± 359 ns per loop
311 µs ± 1.18 µs per loop
268 µs ± 4.26 µs per loop

With _HAVE_BLAS=False and fastmath=True:
125 µs ± 757 ns per loop
327 µs ± 3.69 µs per loop
183 µs ± 654 ns per loop

Moreover, np_convolve of Numba internally flip some array parameter and then perform the dot product using a flipped array that have a non trivial stride (ie. not 1). Such a non-trivial stride may have an impact on the dot product performance. More generally, any transformation preventing the compiler to know that arrays are contiguous will certainly strongly impact the performances. Indeed, the following test shows the impact of working on a contiguous array with the dot product implementation of Numba:

import numpy as np
import numba as nb

def np_dot(a, b):
    return np.dot(a, b)

@nb.njit('float64(float64[::1], float64[::1])')
def nb_dot_cont(a, b):
    return np.dot(a, b)

@nb.njit('float64(float64[::1], float64[:])')
def nb_dot_stride(a, b):
    return np.dot(a, b)

v = np.random.random(128*1024)
%timeit -n 200 np_dot(v, v)         #  36.5 µs ±  4.9 µs per loop
%timeit -n 200 nb_dot_stride(v, v)  # 361.0 µs ± 17.1 µs per loop  (x10 !!!)
%timeit -n 200 nb_dot_cont(v, v)    #  34.1 µs ±  2.9 µs per loop

Some general notes about Numpy and Numba

Note that Numba can hardly accelerate the Numpy calls when they work on pretty-big arrays since Numba re-implement Numpy functions mostly in Python and use a JIT compiler (LLVM-Lite) to speed them up, while Numpy is mostly implemented in plain-C (with a rather-slow Python wrapping code). The Numpy code uses low-level processor features like SIMD instructions to speed up a lot the execution of many functions. Both appear to use BLAS libraries known to be highly optimized. Numpy tends to be generally more optimized since Numpy is currently more mature than Numba: Numpy has much more contributors working since a longer time.

like image 110
Jérôme Richard Avatar answered Nov 14 '22 23:11

Jérôme Richard