I am trying to optimize the following loop :
def numpy(nx, nz, c, rho):
for ix in range(2, nx-3):
for iz in range(2, nz-3):
a[ix, iz] = sum(c*rho[ix-1:ix+3, iz])
b[ix, iz] = sum(c*rho[ix-2:ix+2, iz])
return a, b
I tried different solutions and found using numba to calculate the sum of the product leads to better performances:
import numpy as np
import numba as nb
import time
@nb.autojit
def sum_opt(arr1, arr2):
s = arr1[0]*arr2[0]
for i in range(1, len(arr1)):
s+=arr1[i]*arr2[i]
return s
def numba1(nx, nz, c, rho):
for ix in range(2, nx-3):
for iz in range(2, nz-3):
a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz])
b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz])
return a, b
@nb.autojit
def numba2(nx, nz, c, rho):
for ix in range(2, nx-3):
for iz in range(2, nz-3):
a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz])
b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz])
return a, b
nx = 1024
nz = 256
rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))
ti = time.clock()
a, b = numpy(nx, nz, c, rho)
print 'Time numpy : ' + `round(time.clock() - ti, 4)`
ti = time.clock()
a, b = numba1(nx, nz, c, rho)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`
ti = time.clock()
a, b = numba2(nx, nz, c, rho)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`
This lead to
Time numpy : 4.1595
Time numba1 : 0.6993
Time numba2 : 1.0135
Using the numba version of the sum function (sum_opt) performs very well. But I am wondering why the numba version of the double loop function (numba2) leads to slower execution times. I tried to use jit instead of autojit, specifying the argument types, but it was worse.
I also noticed that looping first on the smallest loop is slower than looping first on the biggest loop. Is there any explanation ?
Whether it is, I am sure this double loop function can be improved a lot vectorizing the problem (like this) or using another method (map ?) but I am a little bit confused about these methods.
In the other parts of my code, I used numba and numpy slicing methods to replace all explicit loops but in this particular case, I don't how to set it up.
Any ideas ?
EDIT
Thanks for all your comments. I worked a little on this problem:
import numba as nb
import numpy as np
from scipy import signal
import time
@nb.jit(['float64(float64[:], float64[:])'], nopython=True)
def sum_opt(arr1, arr2):
s = arr1[0]*arr2[0]
for i in xrange(1, len(arr1)):
s+=arr1[i]*arr2[i]
return s
@nb.autojit
def numba1(nx, nz, c, rho, a, b):
for ix in range(2, nx-3):
for iz in range(2, nz-3):
a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz])
b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz])
return a, b
@nb.jit(nopython=True)
def numba2(nx, nz, c, rho, a, b):
for ix in range(2, nx-3):
for iz in range(2, nz-3):
a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz])
b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz])
return a, b
@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3a(nx, nz, c, rho, a):
for ix in range(2, nx-3):
for iz in range(2, nz-3):
a[ix, iz] = sum_opt(c, rho[ix-1:ix+3, iz])
return a
@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3b(nx, nz, c, rho, b):
for ix in range(2, nx-3):
for iz in range(2, nz-3):
b[ix, iz] = sum_opt(c, rho[ix-2:ix+2, iz])
return b
def convol(nx, nz, c, aa, bb):
s1 = rho[1:nx-1,2:nz-3]
s2 = rho[0:nx-2,2:nz-3]
kernel = c[:,None][::-1]
aa[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
bb[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')
return aa, bb
nx = 1024
nz = 256
rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))
ti = time.clock()
for i in range(1000):
a, b = numba1(nx, nz, c, rho, a, b)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`
ti = time.clock()
for i in range(1000):
a, b = numba2(nx, nz, c, rho, a, b)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`
ti = time.clock()
for i in range(1000):
a = numba3a(nx, nz, c, rho, a)
b = numba3b(nx, nz, c, rho, b)
print 'Time numba3 : ' + `round(time.clock() - ti, 4)`
ti = time.clock()
for i in range(1000):
a, b = convol(nx, nz, c, a, b)
print 'Time convol : ' + `round(time.clock() - ti, 4)`
Your solution is very elegant Divakar, but I have to use this function a large number of time in my code. So, for 1000 iterations, this lead to
Time numba1 : 3.2487
Time numba2 : 3.7012
Time numba3 : 3.2088
Time convol : 22.7696
autojit
and jit
are very close.
However, when using jit
, it seems important to specify all argument types.
I do not know if there is a way to specify argument types in the jit
decorator when the function has multiple outputs. Someone ?
For now I did not find other solution than using numba. New ideas are welcomed !
for(i=0;i<100;i++) a[i]=… After optimization: for(i=0;i<100;i+=2) {a[i]=… a[i+1]=… } This attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops that have the same bodies but iterate over different portions of the index range.
The best ways to improve loop performance are to decrease the amount of work done per iteration and decrease the number of loop iterations. Generally speaking, switch is always faster than if-else , but isn't always the best solution.
First, Write an outer for loop that will iterate the first list like [for i in first] Next, Write an inner loop that will iterate the second list after the outer loop like [for i in first for j in second] Last, calculate the addition of the outer number and inner number like [i+j for i in first for j in second]
You are basically performing 2D convolution there, with a small modification that your kernel is not reversing as the usual convolution
operation does.
So, basically, there are two things we need to do here to use signal.convolve2d
to solve our case -
rho
to select a portion of it which is used in the original loopy version of your code. This would be the input data to convolution.c
and feed it alongwith the sliced data to signal.convolve2d
.Please note that these are to be done for calculation of both a
and b
separately.
Here's the implementation -
import numpy as np
from scipy import signal
# Slices for convolutions to get a and b respectively
s1 = rho[1:nx-1,2:nz-3]
s2 = rho[0:nx-2,2:nz-3]
kernel = c[:,None][::-1] # convolution kernel
# Setup output arrays and fill them with convolution results
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))
a[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
b[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')
If you don't need the extra zeros around the boundaries of the output arrays, you could simply use the outputs from signal.convolve2d
as they are, which must further boost up the performance.
Runtime tests
In [532]: %timeit loop_based(nx, nz, c, rho)
1 loops, best of 3: 1.52 s per loop
In [533]: %timeit numba1(nx, nz, c, rho)
1 loops, best of 3: 282 ms per loop
In [534]: %timeit numba2(nx, nz, c, rho)
1 loops, best of 3: 509 ms per loop
In [535]: %timeit conv_based(nx, nz, c, rho)
10 loops, best of 3: 15.5 ms per loop
So, for the actual input datasize, the proposed convolution based approach is about 100x
faster than the loopy code and about 20x
better than the fastest numba
based approach numba1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With