I just started with Julia and translated my MATLAB code into Julia (basically line-by-line). I noticed that the Julia code is much slower (like 50x). The original problem is a dynamic programming problem in which I interpolate the value function -- that interpolation is where the code spents most of the time. So I tried to make a minimal example code showing the performance differences. Important things to note is that it's a spline approximation for the interpolation and that the grid is preferrably irregular, i.e. not equally spaced. The MATLAB code:
tic
spacing=1.5;
Nxx = 300;
Naa = 350;
Nalal = 200;
sigma = 10;
NoIter = 500;
xx=NaN(Nxx,1);
xmin = 0.01;
xmax = 400;
xx(1) = xmin;
for i=2:Nxx
xx(i) = xx(i-1) + (xmax-xx(i-1))/((Nxx-i+1)^spacing);
end
f_U = @(c) c.^(1-sigma)/(1-sigma);
W=NaN(Nxx,1);
W(:,1) = f_U(xx);
xprime = ones(Nalal,Naa);
for i=1:NoIter
W_temp = interp1(xx,W(:,1),xprime,'spline');
end
toc
Elapsed time is 0.242288 seconds.
The Julia code:
using Dierckx
function performance()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
for k=1:Nalal
W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
end
end
end
end
@time(performance())
4.200732 seconds (70.02 M allocations: 5.217 GB, 4.35% gc time)
I made W a 2-d array because in the original problem it is a matrix. I did some research on different interpolation packages but there weren't so much options for irregular grid and splines. MATLAB's interp1 is apparently not available.
My problem is that I was thinking if I write Julia code and it gives the same result as MATLAB, then Julia should be genereally faster. But apparently that is not the case, so that you need to pay SOME attention to your coding. I am not a programmer, of course I try to do my best, but I would like to know whether I am doing some obvious mistake here that can easily be fixed or whether it will happen (too) often that I have to pay attention to my Julia coding -- because then it might not worth it for me to learn it. In the same vein, if I can make Julia faster here (which I am pretty sure I can, e.g. the allocation looks a bit large), I could probably also make MATLAB faster. My hope with Julia was that -- for similar code -- it will run faster than MATLAB.
After some comments about the timing, I also ran this code:
using Dierckx
tic()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
for k=1:Nalal
W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
end
end
end
toc()
elapsed time:
7.336371495 seconds
Even much slower, uhm...
Another edit: eliminating one loop actually makes it faster in this case, still not comparable to MATLAB though. The code:
function performance2()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
for j=1:Naa
W_temp[:,j] = evaluate(interp_spline, xprime[:,j])
end
end
end
@time(performance2())
1.573347 seconds (700.04 k allocations: 620.643 MB, 1.08% gc time)
Another edit, now iterating the same number of times through the loop:
function performance3()
const spacing=1.5
const Nxx = 300
const Naa = 350
const Nalal = 200
const sigma = 10
const NoIter = 500
xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end
f_U(c) = c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)
W_temp = Array(Float64,Nalal,Naa)
W_tempr = Array(Float64, Nalal*Naa)
interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
xprimer = reshape(xprime, Nalal*Naa)
for i=1:NoIter
W_tempr = evaluate(interp_spline, xprimer)
end
W_temp = reshape(W_tempr, Nalal, Naa)
end
tic()
performance3()
toc()
elapsed time:
1.480349334 seconds
Still, not at all comparable with MATLAB. By the way, in my actual problem I am running the loop easily 50k times and I am accessing bigger xprime matrices although not sure whether that part makes a difference.
Because I'm also learning Julia, I have tried possible speed up of OP's code (for my practice!). And it seems that the bottleneck is essentially the underlying Fortran code. To verify this, I first rewrote the OP's code to a minimal form as follows:
using Dierckx
function perf()
Nx = 300
xinp = Float64[ 2pi * i / Nx for i = 1:Nx ]
yinp = sin( xinp )
interp = Spline1D( xinp, yinp )
Nsample = 200 * 350
x = ones( Nsample ) * pi / 6
y = zeros( Nsample )
for i = 1:500
y[:] = evaluate( interp, x )
end
@show y[1:3] # The result should be 0.5 (= sin(pi/6))
end
@time perf()
@time perf()
@time perf()
where the size of the problem is kept the same, while the input x & y coordinates were changed so that the result is trivially known (0.5). On my machine, the result is
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.886956 seconds (174.20 k allocations: 277.414 MB, 3.55% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.659290 seconds (1.56 k allocations: 269.295 MB, 0.39% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
1.648145 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)
From now on, I will omit y[1:3] for brevity (I have confirmed that in all cases the obtained y[1:3] is correct). If we replace evaluate()
by copy!(y,x)
, the result becomes
0.206723 seconds (168.26 k allocations: 10.137 MB, 10.27% gc time)
0.023068 seconds (60 allocations: 2.198 MB)
0.023013 seconds (60 allocations: 2.198 MB)
so essentially all the time is spent in evaluate()
. Now looking at the original code of this routine, we see that it calls splev() in Fortran, which in turn calls fpbspl() (both originally from Netlib). Those routines are rather old (written in ~1990) and seem not very well optimized for current computers (for example, there are lots of IF branches and vectorization may be difficult...). Because it is not trivial how to "vectorize" the code, I instead tried parallelization with OpenMP. The modified splev()
is like this, where the input points are simply divided into threads:
subroutine splev(t,n,c,k,x,y,m,e,ier)
c subroutine splev evaluates in a number of points x(i),i=1,2,...,m
c a spline s(x) of degree k, given in its b-spline representation.
(... same here ...)
c main loop for the different points.
c$omp parallel do default(shared)
c$omp.firstprivate(arg,ier,l1,l,ll,sp,h) private(i,j)
do i = 1, m
c fetch a new x-value arg.
arg = x(i)
c check if arg is in the support
if (arg .lt. tb .or. arg .gt. te) then
if (e .eq. 0) then
goto 35
else if (e .eq. 1) then
y(i) = 0
goto 80
else if (e .eq. 2) then
ier = 1
! goto 100 !! I skipped this error handling for simplicity.
else if (e .eq. 3) then
if (arg .lt. tb) then
arg = tb
else
arg = te
endif
endif
endif
c search for knot interval t(l) <= arg < t(l+1)
35 if ( t(l) <= arg .or. l1 == k2 ) go to 40
l1 = l
l = l - 1
go to 35
40 if ( arg < t(l1) .or. l == nk1 ) go to 50
l = l1
l1 = l + 1
go to 40
c evaluate the non-zero b-splines at arg.
50 call fpbspl(t, n, k, arg, l, h)
c find the value of s(x) at x=arg.
sp = 0.0d0
ll = l - k1
do 60 j = 1, k1
ll = ll + 1
sp = sp + c(ll)*h(j)
60 continue
y(i) = sp
80 continue
enddo
c$omp end parallel do
100 return
end
Now rebuilding the package with gfortran -fopenmp
and calling perf()
above gives
$ OMP_NUM_THREADS=1 julia interp.jl
1.911112 seconds (174.20 k allocations: 277.414 MB, 3.49% gc time)
1.599154 seconds (1.56 k allocations: 269.295 MB, 0.41% gc time)
1.671308 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)
$ OMP_NUM_THREADS=2 julia interp.jl
1.308713 seconds (174.20 k allocations: 277.414 MB, 5.14% gc time)
0.874616 seconds (1.56 k allocations: 269.295 MB, 0.46% gc time)
0.897505 seconds (1.56 k allocations: 269.295 MB, 0.21% gc time)
$ OMP_NUM_THREADS=4 julia interp.jl
0.749203 seconds (174.20 k allocations: 277.414 MB, 9.31% gc time)
0.446702 seconds (1.56 k allocations: 269.295 MB, 0.93% gc time)
0.439522 seconds (1.56 k allocations: 269.295 MB, 0.43% gc time)
$ OMP_NUM_THREADS=8 julia interp.jl
0.478504 seconds (174.20 k allocations: 277.414 MB, 14.66% gc time)
0.243258 seconds (1.56 k allocations: 269.295 MB, 1.81% gc time)
0.233157 seconds (1.56 k allocations: 269.295 MB, 0.89% gc time)
$ OMP_NUM_THREADS=16 julia interp.jl
0.379243 seconds (174.20 k allocations: 277.414 MB, 19.02% gc time)
0.129145 seconds (1.56 k allocations: 269.295 MB, 3.49% gc time)
0.124497 seconds (1.56 k allocations: 269.295 MB, 1.80% gc time)
# Julia: v0.4.0, Machine: Linux x86_64 (2.6GHz, Xeon2.60GHz, 16 cores)
so the scaling seems trivially good (but please let me know if I am making a big mistake about using OpenMP in this way...) If the above result is correct, it means that 8 threads are necessary with this Fortran code to match the speed of interp1()
on OP's machine. But the good news is that there are probably room for improvement of the Fortran codes (even for serial run). Anyway, the OP's program (as in the final form) seems to be comparing the performance of the underlying interpolation routine, i.e., interp1()
in Matlab vs splev()
in Fortran.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With