I have 5000 3D points in a Matrix A
and another 5000 3D point in a matrix B
.
For each point in A
i want to find the smallest distance to a point in B
. These distances should be stored in an array with 5000 entries.
So far I have this solution, running in about 0.145342 seconds (23 allocations: 191.079 MiB)
. How can I improve this further?
using Distances
A = rand(5000, 3)
B = rand(5000, 3)
mis = @time minimum(Distances.pairwise(SqEuclidean(), A, B, dims=1), dims=2)
This is a standard way to do it as it will have a better time complexity (especially for larger data):
using NearestNeighbors
nn(KDTree(B'; leafsize = 10), A')[2] .^ 2
Two comments:
B'
and A'
in the solution; if your original data were transposed it would not be needed; the reason why it is designed this way is that Julia uses column major matrix storage)Generating a big distance matrix using Distances.pairwise(SqEuclidean(), A, B, dims=1)
is not efficient because the main memory is pretty slow nowadays compared to CPU caches and the computing power of modern CPUs and this is not gonna be better any time soon (see "memory wall"). It is faster to compute the minimum on-the-fly using two basic nested for loops. Additionally, one can use multiple cores to compute this faster using multiple threads.
function computeMinDist(A, B)
n, m = size(A, 1), size(B, 1)
result = zeros(n)
Threads.@threads for i = 1:n
minSqDist = Inf
@inbounds for j = 1:m
dx = A[i,1] - B[j,1]
dy = A[i,2] - B[j,2]
dz = A[i,3] - B[j,3]
sqDist = dx*dx + dy*dy + dz*dz
if sqDist < minSqDist
minSqDist = sqDist
end
end
result[i] = minSqDist
end
return result
end
mis = @time computeMinDist(A, B)
Note the Julia interpreter uses 1 thread by default but this can be tuned using the environment variable JULIA_NUM_THREADS=auto
or just by running it using the flag --threads=auto
. See the multi-threading documentation for more information.
Here are performance results on my i5-9600KF machine with 6 cores (with two 5000x3 matrices):
Initial implementation: 93.4 ms
This implementation: 4.4 ms
This implementation is thus 21 times faster. Results are the same to few ULP.
Note the code can certainly be optimized further using loop tiling, and possibly by transposing A
and B
so the JIT can generate a more efficient implementation using SIMD instructions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With