Optimize Double loop in python

Q: How do you optimize multiple loops?

for(i=0;i<100;i++) a[i]=… After optimization: for(i=0;i<100;i+=2) {a[i]=… a[i+1]=… } This attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops that have the same bodies but iterate over different portions of the index range.

Q: How can I improve my loop performance?

The best ways to improve loop performance are to decrease the amount of work done per iteration and decrease the number of loop iterations. Generally speaking, switch is always faster than if-else , but isn't always the best solution.

Q: How do you create a nested loop in Python?

First, Write an outer for loop that will iterate the first list like [for i in first] Next, Write an inner loop that will iterate the second list after the outer loop like [for i in first for j in second] Last, calculate the addition of the outer number and inner number like [i+j for i in first for j in second]

Tags:

performance

python

loops

numpy

numba

I am trying to optimize the following loop :

def numpy(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):
            a[ix, iz]  = sum(c*rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum(c*rho[ix-2:ix+2, iz])
    return a, b

I tried different solutions and found using numba to calculate the sum of the product leads to better performances:

import numpy as np
import numba as nb
import time

@nb.autojit
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in range(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

def numba1(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.autojit
def numba2(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b
nx = 1024
nz = 256    

rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
a, b = numpy(nx, nz, c, rho)
print 'Time numpy  : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba1(nx, nz, c, rho)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba2(nx, nz, c, rho)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

This lead to

Time numpy : 4.1595

Time numba1 : 0.6993

Time numba2 : 1.0135

Using the numba version of the sum function (sum_opt) performs very well. But I am wondering why the numba version of the double loop function (numba2) leads to slower execution times. I tried to use jit instead of autojit, specifying the argument types, but it was worse.

I also noticed that looping first on the smallest loop is slower than looping first on the biggest loop. Is there any explanation ?

Whether it is, I am sure this double loop function can be improved a lot vectorizing the problem (like this) or using another method (map ?) but I am a little bit confused about these methods.

In the other parts of my code, I used numba and numpy slicing methods to replace all explicit loops but in this particular case, I don't how to set it up.

Any ideas ?

EDIT

Thanks for all your comments. I worked a little on this problem:

import numba as nb
import numpy as np
from scipy import signal
import time


@nb.jit(['float64(float64[:], float64[:])'], nopython=True)
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in xrange(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

@nb.autojit
def numba1(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b


@nb.jit(nopython=True)
def numba2(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3a(nx, nz, c, rho, a):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
    return a

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3b(nx, nz, c, rho, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return b

def convol(nx, nz, c, aa, bb):
    s1 = rho[1:nx-1,2:nz-3]
    s2 = rho[0:nx-2,2:nz-3]
    kernel = c[:,None][::-1]
    aa[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
    bb[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')
    return aa, bb


nx = 1024
nz = 256 
rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
for i in range(1000):
    a, b = numba1(nx, nz, c, rho, a, b)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = numba2(nx, nz, c, rho, a, b)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a = numba3a(nx, nz, c, rho, a)
    b = numba3b(nx, nz, c, rho, b)
print 'Time numba3 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = convol(nx, nz, c, a, b)
print 'Time convol : ' + `round(time.clock() - ti, 4)`

Your solution is very elegant Divakar, but I have to use this function a large number of time in my code. So, for 1000 iterations, this lead to

Time numba1 : 3.2487

Time numba2 : 3.7012

Time numba3 : 3.2088

Time convol : 22.7696

autojit and jit are very close. However, when using jit, it seems important to specify all argument types.

I do not know if there is a way to specify argument types in the jit decorator when the function has multiple outputs. Someone ?

For now I did not find other solution than using numba. New ideas are welcomed !

223

asked May 13 '15 09:05

Ipse Lium

1 Answers

You are basically performing 2D convolution there, with a small modification that your kernel is not reversing as the usual convolution operation does. So, basically, there are two things we need to do here to use signal.convolve2d to solve our case -

Slice the input array rho to select a portion of it which is used in the original loopy version of your code. This would be the input data to convolution.
Reverse the kernel, c and feed it alongwith the sliced data to signal.convolve2d.

Please note that these are to be done for calculation of both a and b separately.

Here's the implementation -

import numpy as np
from scipy import signal

# Slices for convolutions to get a and b respectively        
s1 = rho[1:nx-1,2:nz-3]
s2 = rho[0:nx-2,2:nz-3]
kernel = c[:,None][::-1]  # convolution kernel

# Setup output arrays and fill them with convolution results
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

a[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
b[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')

If you don't need the extra zeros around the boundaries of the output arrays, you could simply use the outputs from signal.convolve2d as they are, which must further boost up the performance.

Runtime tests

In [532]: %timeit loop_based(nx, nz, c, rho)
1 loops, best of 3: 1.52 s per loop

In [533]: %timeit numba1(nx, nz, c, rho)
1 loops, best of 3: 282 ms per loop

In [534]: %timeit numba2(nx, nz, c, rho)
1 loops, best of 3: 509 ms per loop

In [535]: %timeit conv_based(nx, nz, c, rho)
10 loops, best of 3: 15.5 ms per loop

So, for the actual input datasize, the proposed convolution based approach is about 100x faster than the loopy code and about 20x better than the fastest numba based approach numba1.

answered Nov 15 '22 20:11

Divakar

Related questions
                            
                                In pytest, how to skip or xfail certain fixtures?
                            
                                Numpy: Fastest way of computing diagonal for each row of a 2d array
                            
                                ImportError: DLL load failed: %1 is not a valid Win32 application for Python Matplotlib
                            
                                Can't connect to redis using django-redis
                            
                                Python regular expression pattern * is not working as expected
                            
                                Python: Why is list comprehension slower than for loop
                            
                                Django REST framework: AttributeError: Serializer object has no attribute 'Meta'
                            
                                Parsing an equation with custom functions in Python
                            
                                Capturing Value instead of Reference in Lambdas
                            
                                TypeError: __init__() missing 2 required positional arguments: 'client_socket' and 'statusMessage'
                            
                                understanding inputs and outputs on scipy.ndimage.map_coordinates
                            
                                django how to use a variable globally
                            
                                Matplotlib: data being plotted over legend when using twinx
                            
                                Python-OpenCV dilate and erode functions don't modify anything
                            
                                How to securely escape command line arguments for the cmd.exe shell on Windows?
                            
                                How to repeat characters in Python without string concatenation?
                            
                                Unlike Numpy, Pandas doesn't seem to like memory strides
                            
                                OSError: [Errno 36] File name too long:
                            
                                python virtualenv does not use correct version of python
                            
                                Splitting Thai text by characters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With