Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A long term puzzle, how to optimize multi-level loops in python?

Tags:

python

numpy

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower than using Fortran in a similar way.

def Delta_Gaussf(Nw, N_bd, N_kp, hw, eigv):
    Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
    for w1 in range(Nw):
        for k1 in range(N_kp):
            for i1 in range(N_bd):
                for j1 in range(N_bd):
                    if ( j1 >= i1 ):
                        Delta_Gauss[w1][k1][i1][j1] = np.exp(pow((eigv[k1][j1]-eigv[k1][i1]-hw[w1])/width,2))
    return Delta_Gauss

I have removed some constants to make it looks simpler.

Could any one help me to optimize this script to increase efficiency?

like image 401
Memories Avatar asked May 08 '26 04:05

Memories


1 Answers

Simply compile it

To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.

You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.

There are basically two ways to gain performance

  1. Vectorize the code as @scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.

  2. Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.

A solution using Numba

You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).

import numba as nb
import numpy as np

#performance-debug info
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit(fastmath=True)
def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv):
    Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
    for w1 in range(Nw):
        for k1 in range(N_kp):
            for i1 in range(N_bd):
                for j1 in range(N_bd):
                    if ( j1 >= i1 ):
                        Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
    return Delta_Gauss

Due to the 'if' the SIMD-vectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss) will be necessary). I also parallelized the function.

@nb.njit(fastmath=True,parallel=True)
def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv):
    Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64)
    for w1 in nb.prange(Nw):
        for k1 in range(N_kp):
            for i1 in range(N_bd):
                for j1 in range(N_bd):
                    Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
    return Delta_Gauss

Performance

Nw = 20
N_bd = 20
N_kp = 20
width=20
hw = np.linspace(0., 1.0, Nw) 
eigv = np.zeros((N_kp, N_bd),dtype=np.float) 

Your version:           0.5s
first_compiled version: 1.37ms
parallel version:       0.55ms

These easy optimizations lead to about 1000x speedup.

like image 184
max9111 Avatar answered May 09 '26 19:05

max9111



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!