I have a program for a simulation and inside the program I have a function. I have realized that the function consumes most time of simulation. So, I am trying to optimize the funcion first. The function is as follows
Julia version 1.1:
function fun_jul(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1)/2)*cos(n*pi*(x+1)/2);
K = length(ksi);
Z = zeros(length(x),K);
for n in 1:M
for k in 1:K
for l in 1:length(x)
Z[l,k] += (1-(n/(M+1))^2)^xi*F(n,ksi[k])*F(n,x[l]);
end
end
end
return Z
end
I also rewrite the above function in python+numba for comparison as follows
Python+numba
import numpy as np
from numba import prange, jit
@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
But Julia codes are very slow here are my results:
Julia results:
using BenchmarkTools
N=400; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
@benchmark fun_jul(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 1.22 MiB
allocs estimate: 2
--------------
minimum time: 25.039 s (0.00% GC)
median time: 25.039 s (0.00% GC)
mean time: 25.039 s (0.00% GC)
maximum time: 25.039 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
Python results:
N=400;a = -0.5;b = 0.5;x = np.linspace(a,b,N);cc = x;M = 2*N + 100;xi = M/40;
%timeit fun_py(M,cc,xi,x);
1.2 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Any help on improving the codes both for julia and python+numba would be appreciated.
Updated
Based on @Przemyslaw Szufel's answer and the other posts I have improved numba and julia codes. Now both are parallelized. Here are timings
Python+Numba times:
@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((K,len(x)));
for n in range(1,M+1):
pw = (1-(n/(M+1))**2)**xi; f=F(n,x)
for k in prange(0,K):
Z[k,:] = Z[k,:] + pw*F(n,ksi[k])*f;
return Z
N=1000; a=-0.5; b=0.5; x=np.linspace(a,b,N); cc=x; M = 2*N+100; xi = M/40;
%timeit fun_py(M,cc,xi,x);
733 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Julia times
N=1000; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
@benchmark fun_jul2(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 40.31 MiB
allocs estimate: 6302
--------------
minimum time: 705.470 ms (0.17% GC)
median time: 726.403 ms (0.17% GC)
mean time: 729.032 ms (1.68% GC)
maximum time: 765.426 ms (5.27% GC)
--------------
samples: 7
evals/sample: 1
I got down to 300ms on a single thread (instead of 28s on my machine) with the following code.
You are using multi-threading for Numba. In Julia you should use parallel processing (multi-threading support is experimental fo Julia). It seems that your code is doing some kind of parameter sweep - such codes are very easy to parallelize but it usually requires some adjustments to your computational process.
Here is the code:
function fun_jul2(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
for k in 1:K
for l in 1:L
Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul(M,cc,xi,x)
true
julia> @time fun_jul2(M,cc,xi,x);
0.305269 seconds (1.81 k allocations: 6.934 MiB, 1.60% gc time)
** EDIT: with multithreading and inbounds suggested by DNF:
function fun_jul3(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
Threads.@threads for k in 1:K
for l in 1:L
@inbounds Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
And now the running time (remember to run set JULIA_NUM_THREADS=4
or Linux equivalent before launching Julia):
julia> fun_jul2(M,cc,xi,x) ≈ fun_jul3(M,cc,xi,x)
true
julia> @time fun_jul3(M,cc,xi,x);
0.051470 seconds (2.71 k allocations: 6.989 MiB)
You could also try to further experiment with parallelizing of computing of F_im1
and F_im2
.
You can do, or fail to do, loop optimization in any language that has loops. The major difference here is that the numba code is vectorized for the inner loop but the Julia code is not. To vectorize the Julia version, it is sometimes necessary to change operators to their vectorized versions with the ., so that + becomes .+ for example.
Since I cannot get Numba to install properly on my older Windows 10 machine, I ran the code versions below on free Linux versions on the Web. This means I had to use the Python interface for timeit(), not the command line.
Run in Jupyter at mybinder, probably with 1 thread since it is not specified. :
import timeit
timeit.timeit("""
@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
N=400; a = -0.5; b = 0.5; x = np.linspace(a,b,N); cc = x;M = 2*N + 100; xi = M/40;
fun_py(M,cc,xi,x)
""", setup ="import numpy as np; from numba import prange, jit", number=5)
Out[1]: 61.07768889795989
Your machine must be a lot faster than Jupyter, ForBonder.
I ran this optimized julia code version below, in Jupyter on JuliaBox, 1 thread kernel specified:
using BenchmarkTools
F(n, x) = sinpi.(n * (x .+ 1) / 2) .* cospi.(n * (x .+ 1) / 2)
function fun_jul2(M, ksi, xi, x)
K = length(ksi)
Z = zeros(length(x), K)
for n in 1:M, k in 1:K
Z[:, k] .+= (1 - (n / (M + 1))^2)^xi * F(n, ksi[k]) * F(n, x)
end
return Z
end
const N=400; const a=-0.5; const b=0.5; const x=range(a,b,length=N);
const cc=x; const M = 2*N+100; const xi = M/40;
@btime fun_jul2(M, cc, xi, x)
8.076 s (1080002 allocations: 3.35 GiB)
For performance, just precompute the trigonometric part.
Indeed, sin
is a costly operation:
%timeit np.sin(1.)
712 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit 1.2*3.4
5.88 ns ± 0.016 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In python :
@jit
def fun_py2(M,ksi,xi,x):
NN = np.arange(1,M+1)
Fksi = np.sin(np.pi*np.outer(NN,ksi+1))/2 # sin(a)cos(a) is sin(2a)/2
Fx = np.sin(np.pi*np.outer(NN,x+1))/2
U = (1-(NN/(M+1))**2)**xi
Z = np.zeros((len(x),len(ksi)))
for n in range(len(NN)):
for k in range(len(ksi)):
for l in range(len(x)):
Z[k,l] += U[n] * Fksi[n,k] * Fx[n,l];
return Z
For a 30x improvement:
np.allclose(fun_py(M,cc,xi,x),fun_py2(M,cc,xi,x))
True
%timeit fun_py(M,cc,xi,x)
1.14 s ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit fun_py2(M,cc,xi,x)
29.5 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This doesn't trig any parallelism. I suppose the same will occur for Julia.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With