Numba np.convolve really slow

Tags:

I'm trying to speed up a piece of code convolving a 1D array (filter) over each column of a 2D array. Somehow, when I run it with numba's njit, I get a 7x slow down. My thoughts:

Maybe column indexing is slowing it down, but switching to row indexing didn't affect performance
Maybe slice indexing the results of the convolution is slow, but removing it didn't change anything
I've checked that numba understands all the types properly

(tested on Windows 10, python 3.9.4 from conda, numpy 1.12.2, numba 0.53.1)

Can anyone tell me why this code is slow?

Click to copy

import numpy as np
from numba import njit

def f1(a1, filt):
    l2 = filt.size // 2
    res = np.empty(a1.shape)
    for i in range(a1.shape[1]):
        res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
    return res

@njit
def f1_jit(a1, filt):
    l2 = filt.size // 2
    res = np.empty(a1.shape)
    for i in range(a1.shape[1]):
        res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
    return res

a1 = np.random.random((6400, 1000))
filt = np.random.random((65))
f1(a1, filt)
f1_jit(a1, filt)

%timeit f1(a1, filt)     # 404 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f1_jit(a1, filt) # 2.8 s ± 66.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

426

asked Dec 10 '21 23:12

kwsp

1 Answers

The problem comes from the Numba implementation of np.convolve. This is a known issue. It turns out that the current Numba implementation is much slower than the one of Numpy (version <=0.54.1 tested on Windows).

Under the hood

On one hand, the Numpy implementation call correlate which itself performs a dot product that should be implemented by the fast BLAS library available on your system. On the other hand, the Numba implementation calls _get_inner_prod which use np.dot that should also use the same BLAS library (assuming a BLAS is detected which should be the case)...

That being said, there are multiple issues related to the dot product:

First of all, if the internal variable _HAVE_BLAS of numba/np/arraymath.py is manually disabled, Numba use a fallback implementation of the dot product supposed to be significantly slower. However, it turns out that using the fallback dot product implementation used by np.convolve result in a 5 times faster execution than with the BLAS wrapper on my machine! Using additionally the parameter fastmath=True in the njit Numba decorator results in an overall 8.7 times faster execution! Here is the testing code:

Click to copy

import numpy as np
import numba as nb

def npConvolve(a, b):
    return np.convolve(a, b)

@nb.njit('float64[:](float64[:], float64[:])')
def nbConvolveUncont(a, b):
    return np.convolve(a, b)

@nb.njit('float64[::1](float64[::1], float64[::1])')
def nbConvolveCont(a, b):
    return np.convolve(a, b)

a = np.random.random(6400)
b = np.random.random(65)
%timeit -n 100 npConvolve(a, b)
%timeit -n 100 nbConvolveUncont(a, b)
%timeit -n 100 nbConvolveCont(a, b)

Here are raw interesting results:

Click to copy

With _HAVE_BLAS=True (default):
126 µs ± 292 ns per loop
1.6 ms ± 21.3 µs per loop
1.6 ms ± 18.5 µs per loop

With _HAVE_BLAS=False:
125 µs ± 359 ns per loop
311 µs ± 1.18 µs per loop
268 µs ± 4.26 µs per loop

With _HAVE_BLAS=False and fastmath=True:
125 µs ± 757 ns per loop
327 µs ± 3.69 µs per loop
183 µs ± 654 ns per loop

Moreover, np_convolve of Numba internally flip some array parameter and then perform the dot product using a flipped array that have a non trivial stride (ie. not 1). Such a non-trivial stride may have an impact on the dot product performance. More generally, any transformation preventing the compiler to know that arrays are contiguous will certainly strongly impact the performances. Indeed, the following test shows the impact of working on a contiguous array with the dot product implementation of Numba:

Click to copy

import numpy as np
import numba as nb

def np_dot(a, b):
    return np.dot(a, b)

@nb.njit('float64(float64[::1], float64[::1])')
def nb_dot_cont(a, b):
    return np.dot(a, b)

@nb.njit('float64(float64[::1], float64[:])')
def nb_dot_stride(a, b):
    return np.dot(a, b)

v = np.random.random(128*1024)
%timeit -n 200 np_dot(v, v)         #  36.5 µs ±  4.9 µs per loop
%timeit -n 200 nb_dot_stride(v, v)  # 361.0 µs ± 17.1 µs per loop  (x10 !!!)
%timeit -n 200 nb_dot_cont(v, v)    #  34.1 µs ±  2.9 µs per loop

Some general notes about Numpy and Numba

Note that Numba can hardly accelerate the Numpy calls when they work on pretty-big arrays since Numba re-implement Numpy functions mostly in Python and use a JIT compiler (LLVM-Lite) to speed them up, while Numpy is mostly implemented in plain-C (with a rather-slow Python wrapping code). The Numpy code uses low-level processor features like SIMD instructions to speed up a lot the execution of many functions. Both appear to use BLAS libraries known to be highly optimized. Numpy tends to be generally more optimized since Numpy is currently more mature than Numba: Numpy has much more contributors working since a longer time.

110

answered Nov 14 '22 23:11

Jérôme Richard

Related questions
                            
                                Matplotlib figure '.supxlabel' does not work
                            
                                Unable to access the updated global variable's value
                            
                                How to get the pivot lines from two tab-separated files?
                            
                                Update XML with an SQL query
                            
                                Most efficient way to find neighbors of neighbors in python
                            
                                turn "string-like" list into int with python [duplicate]
                            
                                AttributeError: module 'keras.utils.generic_utils' has no attribute 'populate_dict_with_module_objects'
                            
                                Run only tests which depend on the change
                            
                                Pandas - Explode multiple columns in pandas and assign value based on the exploded column
                            
                                Spacy: How to get all words that describe a noun?
                            
                                Why is np.size("") 1?
                            
                                How to type hint a Callable of a function with default arguments?
                            
                                Split a multifasta file to files with the same number of accesion numbers
                            
                                Implementing python decorators in a toy example
                            
                                Use of Generic and TypeVar
                            
                                Strange behaviour when mixing abstractmethod, classmethod and property decorators
                            
                                How to avoid to start hundreds of threads when starting (very short) actions at different timings in the future
                            
                                How to randomly set inputs to zero in keras during training autoencoder (callback)?
                            
                                Find out how similar a set is compared to all other sets in a collection of sets
                            
                                Counting the occurence based on overlapping intervals (HH:MM) in Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Numba np.convolve really slow

Tags:

performance

python

arrays

numpy

numba

kwsp

People also ask

1 Answers

Under the hood

Some general notes about Numpy and Numba

Jérôme Richard

Recent Activity

Donate For Us