I have defined the following recursive array generator and am using Numba jit to try and accelerate the processing (based on this SO answer)
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
res = np.empty(n, dtype="float32")
res[0] = 0
for i in range(1, n):
res[i] = a * res[i - 1] + (1 - a) * (b ** (i - 1))
return res
a = calc_func(0.988, 0.9988, 5000)
I am getting a bunch of warnings/errors that I do not quite get. Would appreciate help in explaining them and making them disappear in order to (I'm assuming) speed up the calculation even more.
Here they are below :
NumbaWarning: Compilation is falling back to object mode WITH looplifting enabled because Function "calc_func" failed type inference due to: Invalid use of Function() with argument(s) of type(s): (int64, dtype=Literalstr) * parameterized
In definition 0: All templates rejected with literals.
In definition 1: All templates rejected without literals. This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function()
[2] During: typing of call at
res = np.empty(n, dtype="float32")
File "thenameofmyscript.py", line 71:
def calc_func(a, b, n):
res = np.empty(n, dtype="float32")
^
@jit("float32:", nopython=False, nogil=True)
thenameofmyscript.py:69: NumbaWarning: Compilation is falling back to object mode WITHOUT looplifting enabled because Function "calc_func" failed type inference due to: cannot determine Numba type of
<class 'numba.dispatcher.LiftedLoop'>
File "thenameofmyscript.py", line 73:
def calc_func(a, b, n):
<source elided>
res[0] = 0
for i in range(1, n):
^
@jit("float32:", nopython=False, nogil=True)
H:\projects\decay-optimizer\venv\lib\site-packages\numba\compiler.py:742: NumbaWarning: Function "calc_func" was compiled in object mode without forceobj=True, but has lifted loops.
File "thenameofmyscript.py", line 70:
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
^
self.func_ir.loc))
H:\projects\decay-optimizer\venv\lib\site-packages\numba\compiler.py:751: NumbaDeprecationWarning: Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
File "thenameofmyscript.py", line 70:
@jit("float32[:](float32,float32,intp)", nopython=False, nogil=True)
def calc_func(a, b, n):
^
warnings.warn(errors.NumbaDeprecationWarning(msg, self.func_ir.loc))
thenameofmyscript.py:69: NumbaWarning: Code running in object mode won't allow parallel execution despite nogil=True. @jit("float32:", nopython=False, nogil=True)
Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.
Numba is what is called a JIT (just-in-time) compiler. It takes Python functions designated by particular annotations (more about that later), and transforms as much as it can — via the LLVM (Low Level Virtual Machine) compiler — to efficient CPU and GPU (via CUDA for Nvidia GPUs and HSA for AMD GPUs) code.
There are two common approaches to compiling Python code - using a Just-In-Time (JIT) compiler and using Cython for Ahead of Time (AOT) compilation.
Large dataFor larger input data, Numba version of function is must faster than Numpy version, even taking into account of the compiling time. In fact, the ratio of the Numpy and Numba run time will depends on both datasize, and the number of loops, or more general the nature of the function (to be compiled).
Modern CPUs are quite fast at additions, subtractions and multiplications. Operations like exponentiation, should be avoided when possible.
Example
In this example I replaced the costly exponentiation by a simple multiplication. Simplifications like that can lead to quite high speedups, but also may change the result.
At first your implementation (float64) without any signatures, I will treat this later on another simple example.
#nb.jit(nopython=True) is a shortcut for @nb.njit()
@nb.njit()
def calc_func_opt_1(a, b, n):
res = np.empty(n, dtype=np.float64)
fact=b
res[0] = 0.
res[1] = a * res[0] + (1. - a) *1.
res[2] = a * res[1] + (1. - a) * fact
for i in range(3, n):
fact*=b
res[i] = a * res[i - 1] + (1. - a) * fact
return res
Also a good idea is to use scalars where possible.
@nb.njit()
def calc_func_opt_2(a, b, n):
res = np.empty(n, dtype=np.float64)
fact_1=b
fact_2=0.
res[0] = fact_2
fact_2=a * fact_2 + (1. - a) *1.
res[1] = fact_2
fact_2 = a * fact_2 + (1. - a) * fact_1
res[2]=fact_2
for i in range(3, n):
fact_1*=b
fact_2= a * fact_2 + (1. - a) * fact_1
res[i] = fact_2
return res
Timings
%timeit a = calc_func(0.988, 0.9988, 5000)
222 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a = calc_func_opt_1(0.988, 0.9988, 5000)
22.7 µs ± 45.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit a = calc_func_opt_2(0.988, 0.9988, 5000)
15.3 µs ± 35.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In Ahead of time mode (AOT) signatures are necessary, but not in the usual JIT mode. The example above is not SIMD- vectorizable. So you won't see much positive nor negative effects of a possibly not optimal declaration of in- and outputs. Let's look at another example.
#Numba is able to SIMD-vectorize this loop if
#a,b,res are contigous arrays
@nb.njit(fastmath=True)
def some_function_1(a,b):
res=np.empty_like(a)
for i in range(a.shape[0]):
res[i]=a[i]**2+b[i]**2
return res
@nb.njit("float64[:](float64[:],float64[:])",fastmath=True)
def some_function_2(a,b):
res=np.empty_like(a)
for i in range(a.shape[0]):
res[i]=a[i]**2+b[i]**2
return res
a=np.random.rand(10_000)
b=np.random.rand(10_000)
#Example for non contiguous input
#a=np.random.rand(10_000)[0::2]
#b=np.random.rand(10_000)[0::2]
%timeit res=some_function_1(a,b)
5.59 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit res=some_function_2(a,b)
9.36 µs ± 47.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Why is the version with signatures slower?
Let's have a closer look on the signatures.
some_function_1.nopython_signatures
#[(array(float64, 1d, C), array(float64, 1d, C)) -> array(float64, 1d, C)]
some_function_2.nopython_signatures
#[(array(float64, 1d, A), array(float64, 1d, A)) -> array(float64, 1d, A)]
#this is equivivalent to
#"float64[::1](float64[::1],float64[::1])"
If the memory layout is unknown at compile time, it is often impossible to SIMD- vectorize the algorithm. Of course you can explicitly declare C-contigous arrays, but the function wont work anymore for non contigous inputs, which is normally not intended.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With