Optimizing suggestions for a piece of Julia and Python code

Question

I have a program for a simulation and inside the program I have a function. I have realized that the function consumes most time of simulation. So, I am trying to optimize the funcion first. The function is as follows

Julia version 1.1:

function fun_jul(M,ksi,xi,x)

  F(n,x) = sin(n*pi*(x+1)/2)*cos(n*pi*(x+1)/2);
  K = length(ksi);
  Z = zeros(length(x),K);
  for n in 1:M
    for k in 1:K
        for l in 1:length(x)
          Z[l,k] +=  (1-(n/(M+1))^2)^xi*F(n,ksi[k])*F(n,x[l]);
        end
    end
  end

return Z

end

I also rewrite the above function in python+numba for comparison as follows

Python+numba

import numpy as np
from numba import prange, jit    
@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):

    K = len(ksi);
    F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);

    Z = np.zeros((len(x),K));
    for n in range(1,M+1):
        for k in prange(0,K):
            Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);


    return Z

But Julia codes are very slow here are my results:

Julia results:

using BenchmarkTools
N=400; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
@benchmark fun_jul(M,cc,xi,x)

BenchmarkTools.Trial: 
  memory estimate:  1.22 MiB
  allocs estimate:  2
  --------------
  minimum time:     25.039 s (0.00% GC)
  median time:      25.039 s (0.00% GC)
  mean time:        25.039 s (0.00% GC)
  maximum time:     25.039 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

Python results:

N=400;a = -0.5;b = 0.5;x = np.linspace(a,b,N);cc = x;M = 2*N + 100;xi = M/40;

%timeit fun_py(M,cc,xi,x);
1.2 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Any help on improving the codes both for julia and python+numba would be appreciated.

Updated

Based on @Przemyslaw Szufel's answer and the other posts I have improved numba and julia codes. Now both are parallelized. Here are timings

Python+Numba times:

@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):

    K = len(ksi);
    F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);

Z = np.zeros((K,len(x)));
for n in range(1,M+1):
    pw = (1-(n/(M+1))**2)**xi; f=F(n,x)
    for k in prange(0,K):
        Z[k,:] = Z[k,:] + pw*F(n,ksi[k])*f;


    return Z


N=1000; a=-0.5; b=0.5; x=np.linspace(a,b,N); cc=x; M = 2*N+100; xi = M/40;

%timeit fun_py(M,cc,xi,x);
733 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Julia times

N=1000; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
@benchmark fun_jul2(M,cc,xi,x)

BenchmarkTools.Trial: 
  memory estimate:  40.31 MiB
  allocs estimate:  6302
  --------------
  minimum time:     705.470 ms (0.17% GC)
  median time:      726.403 ms (0.17% GC)
  mean time:        729.032 ms (1.68% GC)
  maximum time:     765.426 ms (5.27% GC)
  --------------
  samples:          7
  evals/sample:     1

Przemyslaw Szufel · Accepted Answer

I got down to 300ms on a single thread (instead of 28s on my machine) with the following code.

You are using multi-threading for Numba. In Julia you should use parallel processing (multi-threading support is experimental fo Julia). It seems that your code is doing some kind of parameter sweep - such codes are very easy to parallelize but it usually requires some adjustments to your computational process.

Here is the code:

function fun_jul2(M,ksi,xi,x)
  F(n,x) = sin(n*pi*(x+1))/2;
  K = length(ksi);
  L = length(x);
  Z = zeros(length(x),K);
  for n in 1:M
    F_im1= [F(n,ksi[k]) for k in 1:K]
    F_im2 = [F(n,x[l]) for l in 1:L]
    pow = (1-(n/(M+1))^2)^xi    
    for k in 1:K
       for l in 1:L    
          Z[l,k] += pow*F_im1[k]*F_im2[l];
      end
    end
  end
  Z
end

julia> fun_jul2(M,cc,xi,x) ≈ fun_jul(M,cc,xi,x)
true

julia> @time fun_jul2(M,cc,xi,x);
  0.305269 seconds (1.81 k allocations: 6.934 MiB, 1.60% gc time)

** EDIT: with multithreading and inbounds suggested by DNF:

function fun_jul3(M,ksi,xi,x)
  F(n,x) = sin(n*pi*(x+1))/2;
  K = length(ksi);
  L = length(x);
  Z = zeros(length(x),K);
  for n in 1:M
    F_im1= [F(n,ksi[k]) for k in 1:K]
    F_im2 = [F(n,x[l]) for l in 1:L]
    pow = (1-(n/(M+1))^2)^xi    
    Threads.@threads for k in 1:K
       for l in 1:L    
          @inbounds Z[l,k] += pow*F_im1[k]*F_im2[l];
      end
    end
  end
  Z
end

And now the running time (remember to run set JULIA_NUM_THREADS=4 or Linux equivalent before launching Julia):

julia>  fun_jul2(M,cc,xi,x) ≈ fun_jul3(M,cc,xi,x)
true

julia> @time fun_jul3(M,cc,xi,x);
  0.051470 seconds (2.71 k allocations: 6.989 MiB)

You could also try to further experiment with parallelizing of computing of F_im1 and F_im2.

Bill · Answer

You can do, or fail to do, loop optimization in any language that has loops. The major difference here is that the numba code is vectorized for the inner loop but the Julia code is not. To vectorize the Julia version, it is sometimes necessary to change operators to their vectorized versions with the ., so that + becomes .+ for example.

Since I cannot get Numba to install properly on my older Windows 10 machine, I ran the code versions below on free Linux versions on the Web. This means I had to use the Python interface for timeit(), not the command line.

Run in Jupyter at mybinder, probably with 1 thread since it is not specified. :

import timeit
timeit.timeit("""
@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):

    K = len(ksi);
    F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);

    Z = np.zeros((len(x),K));
    for n in range(1,M+1):
        for k in prange(0,K):
            Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);


    return Z


N=400; a = -0.5; b = 0.5; x = np.linspace(a,b,N); cc = x;M = 2*N + 100; xi = M/40;
fun_py(M,cc,xi,x)
""", setup ="import numpy as np; from numba import prange, jit", number=5)

Out[1]: 61.07768889795989

Your machine must be a lot faster than Jupyter, ForBonder.

I ran this optimized julia code version below, in Jupyter on JuliaBox, 1 thread kernel specified:

using BenchmarkTools

F(n, x) = sinpi.(n * (x .+ 1) / 2) .* cospi.(n * (x .+ 1) / 2)

function fun_jul2(M, ksi, xi, x)
    K = length(ksi)
    Z = zeros(length(x), K)
    for n in 1:M, k in 1:K
        Z[:, k] .+= (1 - (n / (M + 1))^2)^xi * F(n, ksi[k]) * F(n, x)
    end
    return Z
end


const N=400; const a=-0.5; const b=0.5; const x=range(a,b,length=N); 
const cc=x; const M = 2*N+100; const xi = M/40;

@btime fun_jul2(M, cc, xi, x)

8.076 s (1080002 allocations: 3.35 GiB)

B. M. · Answer

For performance, just precompute the trigonometric part. Indeed, sin is a costly operation:

%timeit np.sin(1.)
712 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit 1.2*3.4
5.88 ns ± 0.016 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

In python :

@jit
def fun_py2(M,ksi,xi,x):

    NN = np.arange(1,M+1)
    Fksi = np.sin(np.pi*np.outer(NN,ksi+1))/2  # sin(a)cos(a) is sin(2a)/2
    Fx = np.sin(np.pi*np.outer(NN,x+1))/2
    U = (1-(NN/(M+1))**2)**xi
    Z = np.zeros((len(x),len(ksi)))

    for n in range(len(NN)):
        for k in range(len(ksi)):
            for l in range(len(x)):
                Z[k,l] += U[n] * Fksi[n,k] * Fx[n,l];
    return Z

For a 30x improvement:

np.allclose(fun_py(M,cc,xi,x),fun_py2(M,cc,xi,x))
True

%timeit fun_py(M,cc,xi,x)
1.14 s ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit fun_py2(M,cc,xi,x)
29.5 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This doesn't trig any parallelism. I suppose the same will occur for Julia.

Optimizing suggestions for a piece of Julia and Python code

Tags:

performance

python

optimization

julia

numba

For Bonder

Video Answer

3 Answers

Przemyslaw Szufel

Bill

B. M.

Recent Activity

Donate For Us

Optimizing suggestions for a piece of Julia and Python code

Tags:

performance

python

optimization

julia

numba

For Bonder

Video Answer

3 Answers

Przemyslaw Szufel

Bill

B. M.

Related questions

Recent Activity

Donate For Us