Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this devectorized Julia code over 20x too slow?

Tags:

julia

I thought in Julia (unlike in R or Matlab) devectorized code was often faster than vectorized code. But I'm not finding this to be the case. Here's an example:

julia> x = Float64[1:10000000];

julia> y = Array(Float64, length(x));

julia> @time for i = 1:length(x) y[i] = exp(x[i]) end;
elapsed time: 7.014107314 seconds (959983704 bytes allocated, 25.39% gc time)

julia> @time y = exp(x);
elapsed time: 0.364695612 seconds (80000128 bytes allocated)

Why is the vectorized code so much faster? It looks like the devectorized code is allocating over 10x as much memory. But only a few bytes actually need to be allocated to exponentiate any number of floats. Is there a way to write the devectorized code so that it doesn't allocate so much memory, and thus runs faster than the vectorized code?

Thanks!

like image 681
Jeff Avatar asked Dec 28 '14 02:12

Jeff


2 Answers

Consider the following code snippet:

x = Float64[1:10000000];
y = Array(Float64, length(x));
function nonglobal_devec!(x,y)
    for i = 1:length(x) y[i] = exp(x[i]) end
end
function nonglobal_vec(x)
    exp(x)
end
@time nonglobal_devec!(x,y);
@time y = nonglobal_vec(x);
x = Float64[1:10000000];
y = Array(Float64, length(x));
@time for i = 1:length(x) y[i] = exp(x[i]) end
@time y = exp(x)

which gives the times

A: elapsed time: 0.072701108 seconds (115508 bytes allocated)
B: elapsed time: 0.074584697 seconds (80201532 bytes allocated)
C: elapsed time: 2.029597656 seconds (959990464 bytes allocated, 22.86% gc time)
D: elapsed time: 0.058509661 seconds (80000128 bytes allocated)

The odd one out, C, is due to it operating in the global scope, where type inference doesn't operate, and slower code is generated.

The relative timings between A and B are subject to some variability due to the functions being compiled the first time they are used. If we run it again we get

A2: elapsed time: 0.038542212 seconds (80 bytes allocated)
B2: elapsed time: 0.063630172 seconds (80000128 bytes allocated)

which makes sense as A2 doesn't allocate memory (that 80 bytes is for the return value of the function), and B2 creates a new vector. Also note that B2 allocates the same amount of memory as D does - the extra the first time was memory allocated for compilation.

Finally, devectorized vs vectorized is a case-by-case basis. For example, if you implemented matrix multiplication naively with for loops, and no cache awareness, you would be likely to be much slower than using the vectorized A*b that uses BLAS.

like image 181
IainDunning Avatar answered Nov 03 '22 11:11

IainDunning


Non-constant global variables are slow:

http://julia.readthedocs.org/en/latest/manual/performance-tips/

like image 30
StefanKarpinski Avatar answered Nov 03 '22 10:11

StefanKarpinski