I am running a program that requires repeated use of pairwise distance calculations across a collection of points in 2D and then calculating the vectors. This ends up causing a bottleneck in my run times so I have tried to re-write my code from Matlab to Julia to take advantage of its faster speeds. The problem I have run into, however, is that the function I have written in Julia is actually running five times slower than my Matlab implementation. Given Julia's reputation as being a substantially faster language, I am assuming that I have done something wrong.
I wrote a simple example to illustrate what I am seeing.
Julia code:
using Distances
function foo()
historyarray = zeros(5000,150,2)
a = randn(150,2)
for i in 1:5000
pd = pairwise(Euclidean(),a.')
xgv = broadcast(-,a[:,1].',a[:,1])
ygv = broadcast(-,a[:,2].',a[:,2])
th = atan2(ygv,xgv)
fv = 1./(pd+1)
xfv = fv*cos(th)
yfv = fv*sin(th)
a[:,1]+= sum(xfv,2)
a[:,2]+= sum(yfv,2)
historyarray[i,:,:] = copy(a)
end
end
Matlab code:
function foo
histarray = zeros(5000,150,2);
a = randn(150,2);
for i=1:5000
pd = pdist2(a,a);
xgv = -bsxfun(@minus,a(:,1),a(:,1)');
ygv = -bsxfun(@minus,a(:,2),a(:,2)');
th = atan2(ygv,xgv);
fv = 1./(pd+1);
xfv = fv.*cos(th);
yfv = fv.*sin(th);
a(:,1) = a(:,1)+sum(xfv,2);
a(:,2) = a(:,2)+sum(yfv,2);
histarray(i,:,:)=a;
end
end
When I test the Julia code's speed (multiple times to account for compiling time) I get:
@time foo()
16.110077 seconds (2.65 M allocations: 8.566 GB, 6.30% gc time)
The Matlab function's performance on the other hand is:
tic
foo
toc
Elapsed time is 3.339807 seconds.
When I have run the profile viewer on the Julia code the component that takes the most amount of time are lines 9, 11, and 12. Is there perhaps something strange happening with trigonometric functions?
You are correct that the calls to sin
, cos
, and atan2
are the bottleneck in your Julia code. Nevertheless, the high number of allocations implies that there is still potential for optimizations.
On the upcoming version of Julia, your code can easily be rewritten to avoid unnecessary allocations by using the improved dot broadcasting syntax a .= f.(b,c)
. This is equivalent to broadcast!(f, a, b, c)
and updates a
in place. Additionally, several dot broadcast calls on the rhs are automatically fused into a single one. Finally, the @views
macro turns all slicing operations like a[:,1]
, which makes a copy, into views. The new code looks like:
function foo2()
a = rand(150,2)
historyarray = zeros(5000,150,2)
pd = zeros(size(a,1), size(a,1))
xgv = similar(pd)
ygv = similar(pd)
th = similar(pd)
fv = similar(pd)
xfv = similar(pd)
yfv = similar(pd)
tmp = zeros(size(a,1))
@views for i in 1:5000
pairwise!(pd, Euclidean(),a.')
xgv .= a[:,1].' .- a[:,1]
ygv .= a[:,2].' .- a[:,2]
th .= atan2.(ygv,xgv)
fv .= 1./(pd.+1)
xfv .= fv.*cos.(th)
yfv .= fv.*sin.(th)
a[:,1:1] .+= sum!(tmp, xfv)
a[:,2:2] .+= sum!(tmp, yfv)
historyarray[i,:,:] = a
end
end
(I use element-wise multiplication in xfv .= fv.*cos.(th)
like in your Matlab code instead of matrix multiplication.)
Benchmarking the new code shows a strong reduction in allocated memory:
julia> @benchmark foo2()
BenchmarkTools.Trial:
memory estimate: 67.88 MiB
allocs estimate: 809507
--------------
minimum time: 7.655 s (0.06% GC)
median time: 7.655 s (0.06% GC)
mean time: 7.655 s (0.06% GC)
maximum time: 7.655 s (0.06% GC)
--------------
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
(Most of this can be achieved on 0.5, but requires a bit more typing)
However, this still takes twice as long as your Matlab version. Profiling reveals that indeed most of the time is spent in the trigonometric functions.
Just for fun I tried:
const atan2 = +
const cos = x->2x
const sin = x->2x
and got:
julia> @benchmark foo2()
BenchmarkTools.Trial:
memory estimate: 67.88 MiB
allocs estimate: 809507
--------------
minimum time: 1.020 s (0.69% GC)
median time: 1.028 s (0.68% GC)
mean time: 1.043 s (2.10% GC)
maximum time: 1.100 s (7.75% GC)
--------------
samples: 5
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
I guess, one reason for the slowness of the trigonometric functions might be that I use the prebuilt binary and no self-compiled version of the libm
library, which is used by Julia. Therefore, the libm
code is not optimized for my processor. But I doubt that this will make Julia much faster than Matlab in this case. The Matlab code just seems to be already close to optimal for this algorithm.
The idea behind the get-rid-of-trig refactor is sin(atan(x,y))==y/sqrt(x^2+y^2)
. Conveniently the function hypot
calculates the square root denominator. inv
is used to get away from slow divisions. The code:
# a constant input matrix to allow foo2/foo3 comparison
a = randn(150,2)
# calculation using trig functions
function foo2(b,n)
a = copy(b)
historyarray = zeros(n,size(a,1),2)
pd = zeros(size(a,1), size(a,1))
xgv = similar(pd)
ygv = similar(pd)
th = similar(pd)
fv = similar(pd)
xfv = similar(pd)
yfv = similar(pd)
tmp = zeros(size(a,1))
@views for i in 1:n
pairwise!(pd, Euclidean(),a.')
xgv .= a[:,1].' .- a[:,1]
ygv .= a[:,2].' .- a[:,2]
th .= atan2.(ygv,xgv)
fv .= 1./(pd.+1)
xfv .= fv.*cos.(th)
yfv .= fv.*sin.(th)
a[:,1:1] .+= sum!(tmp, xfv)
a[:,2:2] .+= sum!(tmp, yfv)
historyarray[i,:,:] = a
end
end
# helper function to handle annoying Infs from self interaction calc
nantoone(x) = ifelse(isnan(x),1.0,x)
nantozero(x) = ifelse(isnan(x),0.0,x)
# calculations using Pythagoras
function foo3(b,n)
a = copy(b)
historyarray = zeros(5000,size(a,1),2)
pd = zeros(size(a,1), size(a,1))
xgv = similar(pd)
ygv = similar(pd)
th = similar(pd)
fv = similar(pd)
xfv = similar(pd)
yfv = similar(pd)
tmp = zeros(size(a,1))
@views for i in 1:n
pairwise!(pd, Euclidean(),a.')
xgv .= a[:,1].' .- a[:,1]
ygv .= a[:,2].' .- a[:,2]
th .= inv.(hypot.(ygv,xgv))
fv .= inv.(pd.+1)
xfv .= nantoone.(fv.*xgv.*th)
yfv .= nantozero.(fv.*ygv.*th)
a[:,1:1] .+= sum!(tmp, xfv)
a[:,2:2] .+= sum!(tmp, yfv)
historyarray[i,:,:] = a
end
end
And a benchamrk:
julia> @time foo2(a,5000)
9.698825 seconds (809.51 k allocations: 67.880 MiB, 0.33% gc time)
julia> @time foo3(a,5000)
2.207108 seconds (809.51 k allocations: 67.880 MiB, 1.15% gc time)
A >4x improvement.
Another takeaway from this is the convenience of a NaN-to-something function which could be added to Base (similar to coalesce
and nvl
from SQL world).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With