Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get faster pairwise distance and vector calculations in Julia 0.5 compared to Matlab2016b?

I am running a program that requires repeated use of pairwise distance calculations across a collection of points in 2D and then calculating the vectors. This ends up causing a bottleneck in my run times so I have tried to re-write my code from Matlab to Julia to take advantage of its faster speeds. The problem I have run into, however, is that the function I have written in Julia is actually running five times slower than my Matlab implementation. Given Julia's reputation as being a substantially faster language, I am assuming that I have done something wrong.

I wrote a simple example to illustrate what I am seeing.

Julia code:

using Distances
function foo()
  historyarray = zeros(5000,150,2)
  a = randn(150,2)
  for i in 1:5000
    pd = pairwise(Euclidean(),a.')
    xgv = broadcast(-,a[:,1].',a[:,1])
    ygv = broadcast(-,a[:,2].',a[:,2])
    th = atan2(ygv,xgv)
    fv = 1./(pd+1)
    xfv = fv*cos(th)
    yfv = fv*sin(th)
    a[:,1]+= sum(xfv,2)
    a[:,2]+= sum(yfv,2)

    historyarray[i,:,:] = copy(a)
  end
end

Matlab code:

function foo
histarray = zeros(5000,150,2);
a = randn(150,2);
for i=1:5000

    pd = pdist2(a,a);
    xgv = -bsxfun(@minus,a(:,1),a(:,1)');
    ygv = -bsxfun(@minus,a(:,2),a(:,2)');
    th = atan2(ygv,xgv);
    fv = 1./(pd+1);
    xfv = fv.*cos(th);
    yfv = fv.*sin(th);
    a(:,1) = a(:,1)+sum(xfv,2);
    a(:,2) = a(:,2)+sum(yfv,2);

    histarray(i,:,:)=a;
end

end

When I test the Julia code's speed (multiple times to account for compiling time) I get:

@time foo()
16.110077 seconds (2.65 M allocations: 8.566 GB, 6.30% gc time)

The Matlab function's performance on the other hand is:

tic
foo
toc
Elapsed time is 3.339807 seconds. 

When I have run the profile viewer on the Julia code the component that takes the most amount of time are lines 9, 11, and 12. Is there perhaps something strange happening with trigonometric functions?

like image 609
Trock Avatar asked Dec 06 '22 15:12

Trock


2 Answers

You are correct that the calls to sin, cos, and atan2 are the bottleneck in your Julia code. Nevertheless, the high number of allocations implies that there is still potential for optimizations.

On the upcoming version of Julia, your code can easily be rewritten to avoid unnecessary allocations by using the improved dot broadcasting syntax a .= f.(b,c). This is equivalent to broadcast!(f, a, b, c) and updates a in place. Additionally, several dot broadcast calls on the rhs are automatically fused into a single one. Finally, the @views macro turns all slicing operations like a[:,1], which makes a copy, into views. The new code looks like:

function foo2()
    a = rand(150,2)
    historyarray = zeros(5000,150,2)
    pd = zeros(size(a,1), size(a,1))
    xgv = similar(pd)
    ygv = similar(pd)
    th = similar(pd)
    fv = similar(pd)
    xfv = similar(pd)
    yfv = similar(pd)
    tmp = zeros(size(a,1))
    @views for i in 1:5000
        pairwise!(pd, Euclidean(),a.')
        xgv .= a[:,1].' .- a[:,1]
        ygv .= a[:,2].' .- a[:,2]
        th .= atan2.(ygv,xgv)
        fv .= 1./(pd.+1)
        xfv .= fv.*cos.(th)
        yfv .= fv.*sin.(th)
        a[:,1:1] .+= sum!(tmp, xfv)
        a[:,2:2] .+= sum!(tmp, yfv)
        historyarray[i,:,:] = a
    end
end

(I use element-wise multiplication in xfv .= fv.*cos.(th) like in your Matlab code instead of matrix multiplication.)

Benchmarking the new code shows a strong reduction in allocated memory:

julia> @benchmark foo2()
BenchmarkTools.Trial: 
  memory estimate:  67.88 MiB
  allocs estimate:  809507
  --------------
  minimum time:     7.655 s (0.06% GC)
  median time:      7.655 s (0.06% GC)
  mean time:        7.655 s (0.06% GC)
  maximum time:     7.655 s (0.06% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

(Most of this can be achieved on 0.5, but requires a bit more typing)

However, this still takes twice as long as your Matlab version. Profiling reveals that indeed most of the time is spent in the trigonometric functions.

Just for fun I tried:

const atan2 = +
const cos = x->2x
const sin = x->2x

and got:

julia> @benchmark foo2()
BenchmarkTools.Trial: 
  memory estimate:  67.88 MiB
  allocs estimate:  809507
  --------------
  minimum time:     1.020 s (0.69% GC)
  median time:      1.028 s (0.68% GC)
  mean time:        1.043 s (2.10% GC)
  maximum time:     1.100 s (7.75% GC)
  --------------
  samples:          5
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

I guess, one reason for the slowness of the trigonometric functions might be that I use the prebuilt binary and no self-compiled version of the libm library, which is used by Julia. Therefore, the libm code is not optimized for my processor. But I doubt that this will make Julia much faster than Matlab in this case. The Matlab code just seems to be already close to optimal for this algorithm.

like image 57
tim Avatar answered Feb 20 '23 01:02

tim


The idea behind the get-rid-of-trig refactor is sin(atan(x,y))==y/sqrt(x^2+y^2). Conveniently the function hypot calculates the square root denominator. inv is used to get away from slow divisions. The code:

# a constant input matrix to allow foo2/foo3 comparison
a = randn(150,2)

# calculation using trig functions
function foo2(b,n)
  a = copy(b)
  historyarray = zeros(n,size(a,1),2)
  pd = zeros(size(a,1), size(a,1))
  xgv = similar(pd)
  ygv = similar(pd)
  th = similar(pd)
  fv = similar(pd)
  xfv = similar(pd)
  yfv = similar(pd)
  tmp = zeros(size(a,1))
  @views for i in 1:n
      pairwise!(pd, Euclidean(),a.')
      xgv .= a[:,1].' .- a[:,1]
      ygv .= a[:,2].' .- a[:,2]
      th .= atan2.(ygv,xgv)
      fv .= 1./(pd.+1)
      xfv .= fv.*cos.(th)
      yfv .= fv.*sin.(th)
      a[:,1:1] .+= sum!(tmp, xfv)
      a[:,2:2] .+= sum!(tmp, yfv)
      historyarray[i,:,:] = a
  end
end

# helper function to handle annoying Infs from self interaction calc
nantoone(x) = ifelse(isnan(x),1.0,x)
nantozero(x) = ifelse(isnan(x),0.0,x)

# calculations using Pythagoras
function foo3(b,n)
  a = copy(b)
  historyarray = zeros(5000,size(a,1),2)
  pd = zeros(size(a,1), size(a,1))
  xgv = similar(pd)
  ygv = similar(pd)
  th = similar(pd)
  fv = similar(pd)
  xfv = similar(pd)
  yfv = similar(pd)
  tmp = zeros(size(a,1))
  @views for i in 1:n
      pairwise!(pd, Euclidean(),a.')
      xgv .= a[:,1].' .- a[:,1]
      ygv .= a[:,2].' .- a[:,2]
      th .= inv.(hypot.(ygv,xgv))
      fv .= inv.(pd.+1)
      xfv .= nantoone.(fv.*xgv.*th)
      yfv .= nantozero.(fv.*ygv.*th)
      a[:,1:1] .+= sum!(tmp, xfv)
      a[:,2:2] .+= sum!(tmp, yfv)
      historyarray[i,:,:] = a
  end
end

And a benchamrk:

julia> @time foo2(a,5000)
  9.698825 seconds (809.51 k allocations: 67.880 MiB, 0.33% gc time)

julia> @time foo3(a,5000)
  2.207108 seconds (809.51 k allocations: 67.880 MiB, 1.15% gc time)

A >4x improvement.

Another takeaway from this is the convenience of a NaN-to-something function which could be added to Base (similar to coalesce and nvl from SQL world).

like image 31
Dan Getz Avatar answered Feb 19 '23 23:02

Dan Getz