Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia vs MATLAB: Why is my Julia code so slow?

I just started with Julia and translated my MATLAB code into Julia (basically line-by-line). I noticed that the Julia code is much slower (like 50x). The original problem is a dynamic programming problem in which I interpolate the value function -- that interpolation is where the code spents most of the time. So I tried to make a minimal example code showing the performance differences. Important things to note is that it's a spline approximation for the interpolation and that the grid is preferrably irregular, i.e. not equally spaced. The MATLAB code:

tic
spacing=1.5;
Nxx = 300; 
Naa = 350;
Nalal = 200; 
sigma = 10;
NoIter = 500;

xx=NaN(Nxx,1);
xmin = 0.01;
xmax = 400;
xx(1) = xmin;
for i=2:Nxx
    xx(i) = xx(i-1) + (xmax-xx(i-1))/((Nxx-i+1)^spacing);
end

f_U = @(c) c.^(1-sigma)/(1-sigma);  
W=NaN(Nxx,1);
W(:,1) = f_U(xx);

xprime = ones(Nalal,Naa);
for i=1:NoIter
     W_temp = interp1(xx,W(:,1),xprime,'spline');
end
toc

Elapsed time is 0.242288 seconds.

The Julia code:

using Dierckx
function performance()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         for k=1:Nalal
         W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
         end
    end
end

end

@time(performance())
4.200732 seconds (70.02 M allocations: 5.217 GB, 4.35% gc time)

I made W a 2-d array because in the original problem it is a matrix. I did some research on different interpolation packages but there weren't so much options for irregular grid and splines. MATLAB's interp1 is apparently not available.

My problem is that I was thinking if I write Julia code and it gives the same result as MATLAB, then Julia should be genereally faster. But apparently that is not the case, so that you need to pay SOME attention to your coding. I am not a programmer, of course I try to do my best, but I would like to know whether I am doing some obvious mistake here that can easily be fixed or whether it will happen (too) often that I have to pay attention to my Julia coding -- because then it might not worth it for me to learn it. In the same vein, if I can make Julia faster here (which I am pretty sure I can, e.g. the allocation looks a bit large), I could probably also make MATLAB faster. My hope with Julia was that -- for similar code -- it will run faster than MATLAB.

After some comments about the timing, I also ran this code:

using Dierckx

tic()
const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
     xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         for k=1:Nalal
         W_temp[k,j] = evaluate(interp_spline, xprime[k,j])
         end
     end
 end
 toc()

elapsed time: 
7.336371495 seconds

Even much slower, uhm...

Another edit: eliminating one loop actually makes it faster in this case, still not comparable to MATLAB though. The code:

function performance2()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
for i=1:NoIter
    for j=1:Naa
         W_temp[:,j] = evaluate(interp_spline, xprime[:,j])
    end
 end

 end

@time(performance2())
1.573347 seconds (700.04 k allocations: 620.643 MB, 1.08% gc time)

Another edit, now iterating the same number of times through the loop:

function performance3()

const spacing=1.5 
const Nxx = 300 
const Naa = 350
const Nalal = 200 
const sigma = 10 
const NoIter = 500 

xx=Array(Float64,Nxx)
xmin = 0.01
xmax = 400
xx[1] = xmin
for i=2:Nxx
    xx[i] = xx[i-1] + (xmax-xx[i-1])/((Nxx-i+1)^spacing)
end

f_U(c) =  c.^(1-sigma)/(1-sigma)
W=Array(Float64,Nxx,1)
W[:,1] = f_U(xx)

W_temp = Array(Float64,Nalal,Naa)
W_tempr = Array(Float64, Nalal*Naa)

interp_spline = Spline1D(xx,W[:,1])
xprime = ones(Nalal,Naa)
xprimer = reshape(xprime, Nalal*Naa)

for i=1:NoIter
        W_tempr = evaluate(interp_spline, xprimer)
end

W_temp = reshape(W_tempr, Nalal, Naa)
end

tic()
performance3()
toc()

elapsed time: 
1.480349334 seconds

Still, not at all comparable with MATLAB. By the way, in my actual problem I am running the loop easily 50k times and I am accessing bigger xprime matrices although not sure whether that part makes a difference.

like image 784
marky2k Avatar asked Jan 13 '16 11:01

marky2k


1 Answers

Because I'm also learning Julia, I have tried possible speed up of OP's code (for my practice!). And it seems that the bottleneck is essentially the underlying Fortran code. To verify this, I first rewrote the OP's code to a minimal form as follows:

using Dierckx

function perf()

    Nx = 300 

    xinp = Float64[ 2pi * i / Nx for i = 1:Nx ]
    yinp = sin( xinp )

    interp = Spline1D( xinp, yinp )

    Nsample = 200 * 350

    x = ones( Nsample ) * pi / 6
    y = zeros( Nsample )

    for i = 1:500
        y[:] = evaluate( interp, x )
    end

    @show y[1:3]  # The result should be 0.5 (= sin(pi/6))
end

@time perf()
@time perf()
@time perf()

where the size of the problem is kept the same, while the input x & y coordinates were changed so that the result is trivially known (0.5). On my machine, the result is

y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.886956 seconds (174.20 k allocations: 277.414 MB, 3.55% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.659290 seconds (1.56 k allocations: 269.295 MB, 0.39% gc time)
y[1:3] = [0.49999999999999994,0.49999999999999994,0.49999999999999994]
  1.648145 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)

From now on, I will omit y[1:3] for brevity (I have confirmed that in all cases the obtained y[1:3] is correct). If we replace evaluate() by copy!(y,x), the result becomes

  0.206723 seconds (168.26 k allocations: 10.137 MB, 10.27% gc time)
  0.023068 seconds (60 allocations: 2.198 MB)
  0.023013 seconds (60 allocations: 2.198 MB)

so essentially all the time is spent in evaluate(). Now looking at the original code of this routine, we see that it calls splev() in Fortran, which in turn calls fpbspl() (both originally from Netlib). Those routines are rather old (written in ~1990) and seem not very well optimized for current computers (for example, there are lots of IF branches and vectorization may be difficult...). Because it is not trivial how to "vectorize" the code, I instead tried parallelization with OpenMP. The modified splev() is like this, where the input points are simply divided into threads:

      subroutine splev(t,n,c,k,x,y,m,e,ier)
c  subroutine splev evaluates in a number of points x(i),i=1,2,...,m
c  a spline s(x) of degree k, given in its b-spline representation.
(... same here ...)

c  main loop for the different points.
c$omp parallel do default(shared)
c$omp.firstprivate(arg,ier,l1,l,ll,sp,h) private(i,j)
      do i = 1, m

c  fetch a new x-value arg.
        arg = x(i)
c  check if arg is in the support
        if (arg .lt. tb .or. arg .gt. te) then
            if (e .eq. 0) then
                goto 35
            else if (e .eq. 1) then
                y(i) = 0
                goto 80
            else if (e .eq. 2) then
                ier = 1
                ! goto 100        !! I skipped this error handling for simplicity.
            else if (e .eq. 3) then
                if (arg .lt. tb) then
                    arg = tb
                else
                    arg = te
                endif
            endif
        endif

c  search for knot interval t(l) <= arg < t(l+1)
 35     if ( t(l) <= arg .or. l1 == k2 ) go to 40
        l1 = l
        l = l - 1
        go to 35
  40    if ( arg < t(l1) .or. l == nk1 ) go to 50
        l = l1
        l1 = l + 1
        go to 40

c  evaluate the non-zero b-splines at arg.
  50    call fpbspl(t, n, k, arg, l, h)

c  find the value of s(x) at x=arg.
        sp = 0.0d0
        ll = l - k1

        do 60 j = 1, k1
          ll = ll + 1
          sp = sp + c(ll)*h(j)
  60    continue
        y(i) = sp

 80     continue

      enddo
c$omp end parallel do
 100  return
      end

Now rebuilding the package with gfortran -fopenmp and calling perf() above gives

$ OMP_NUM_THREADS=1 julia interp.jl
  1.911112 seconds (174.20 k allocations: 277.414 MB, 3.49% gc time)
  1.599154 seconds (1.56 k allocations: 269.295 MB, 0.41% gc time)
  1.671308 seconds (1.56 k allocations: 269.295 MB, 0.28% gc time)

$ OMP_NUM_THREADS=2 julia interp.jl
  1.308713 seconds (174.20 k allocations: 277.414 MB, 5.14% gc time)
  0.874616 seconds (1.56 k allocations: 269.295 MB, 0.46% gc time)
  0.897505 seconds (1.56 k allocations: 269.295 MB, 0.21% gc time)

$ OMP_NUM_THREADS=4 julia interp.jl
  0.749203 seconds (174.20 k allocations: 277.414 MB, 9.31% gc time)
  0.446702 seconds (1.56 k allocations: 269.295 MB, 0.93% gc time)
  0.439522 seconds (1.56 k allocations: 269.295 MB, 0.43% gc time)

$ OMP_NUM_THREADS=8 julia interp.jl
  0.478504 seconds (174.20 k allocations: 277.414 MB, 14.66% gc time)
  0.243258 seconds (1.56 k allocations: 269.295 MB, 1.81% gc time)
  0.233157 seconds (1.56 k allocations: 269.295 MB, 0.89% gc time)

$ OMP_NUM_THREADS=16 julia interp.jl
  0.379243 seconds (174.20 k allocations: 277.414 MB, 19.02% gc time)
  0.129145 seconds (1.56 k allocations: 269.295 MB, 3.49% gc time)
  0.124497 seconds (1.56 k allocations: 269.295 MB, 1.80% gc time)

# Julia: v0.4.0, Machine: Linux x86_64 (2.6GHz, Xeon2.60GHz, 16 cores)

so the scaling seems trivially good (but please let me know if I am making a big mistake about using OpenMP in this way...) If the above result is correct, it means that 8 threads are necessary with this Fortran code to match the speed of interp1() on OP's machine. But the good news is that there are probably room for improvement of the Fortran codes (even for serial run). Anyway, the OP's program (as in the final form) seems to be comparing the performance of the underlying interpolation routine, i.e., interp1() in Matlab vs splev() in Fortran.

like image 111
roygvib Avatar answered Oct 04 '22 12:10

roygvib