Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Euclidean distance matrix performance between two shapes

The problem I am having is that I have to calculate a Euclidean distance matrix between shapes that can range from 20,000 up to 60,000 points, which produces 10-20GB amounts of data. I have to run each of these calculates thousands of times so 20GB x 7,000 (each calculation is a different point cloud). The shapes can be either 2D or 3D.

EDITED (Updated questions)

  1. Is there a more efficient way to calculate the forward and backward distances without using two separate nested loops?

    I know I could save the data matrix and calculate the minimum distances in each direction, but then there is a huge memory issue with large point clouds.

  2. Is there a way to speed up this calculation and/or clean up the code to trim off time?

The irony is that I only need the matrix to calculate a very simple metric, but it requires the entire matrix to find that metric (Average Hausdorff distance).

Data example where each column represents a dimension of the shape and each row is a point in the shape:

first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")

This code calculates a Euclidean distance between between coordinates:

m <- nrow(first_configuration)
n <- nrow(second_configuration)

D <- sqrt(pmax(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration), 0))
D

Output:

     [,1]      [,2]
[1,] 8.660254 10.392305
[2,] 6.928203  8.660254

EDIT: included hausdorff average code

d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)

EDIT (Rcpp solution): Here is my attempt to implement it in Rcpp so the matrix is never saved to memory. Working now but very slow.

sourceCpp(code=
#include <Rcpp.h>
#include <limits>
using namespace Rcpp;

// [[Rcpp::export]]
double edist_rcpp(NumericVector x, NumericVector y){
    double d = sqrt( sum( pow(x - y, 2) ) );
    return d;
}


// [[Rcpp::export]]
double avg_hausdorff_rcpp(NumericMatrix x, NumericMatrix y){
    int nrowx = x.nrow();
    int nrowy = y.nrow();
    double new_low_x = std::numeric_limits<int>::max();
    double new_low_y = std::numeric_limits<int>::max();

    double mean_forward = 0;
    double mean_backward = 0;
    double mean_hd; 
    double td; 

    //forward
    for(int i = 0; i < nrowx; i++) {
        for(int j = 0; j < nrowy; j++) {
            NumericVector v1 = x.row(i);
            NumericVector v2 = y.row(j);
            td = edist_rcpp(v1, v2);
            if(td < new_low_x) {
                new_low_x = td;
            }
        }
        mean_forward = mean_forward + new_low_x;
        new_low_x = std::numeric_limits<int>::max();
    }

    //backward
    for(int i = 0; i < nrowy; i++) {
        for(int j = 0; j < nrowx; j++) {
            NumericVector v1 = y.row(i);
            NumericVector v2 = x.row(j);
            td = edist_rcpp(v1, v2);
            if(td < new_low_y) {
                new_low_y = td;
            }
        }
        mean_backward = mean_backward + new_low_y;
        new_low_y = std::numeric_limits<int>::max();
    }

    //hausdorff mean
    mean_hd = (mean_forward / nrowx + mean_backward / nrowy) / 2;

    return mean_hd;
}
)

EDIT (RcppParallel solution): Definitely faster than the serial Rcpp solution and most certainly the R solution. If anyone has tips on how to improve my RcppParallel code to trim off some extra time it would be much appreciated!

sourceCpp(code=
#include <Rcpp.h>
#include <RcppParallel.h>
#include <limits>

// [[Rcpp::depends(RcppParallel)]]
struct minimum_euclidean_distances : public RcppParallel::Worker {
    //Input
    const RcppParallel::RMatrix<double> a;
    const RcppParallel::RMatrix<double> b;

    //Output
    RcppParallel::RVector<double> medm;

    minimum_euclidean_distances(const Rcpp::NumericMatrix a, const Rcpp::NumericMatrix b, Rcpp::NumericVector medm) : a(a), b(b), medm(medm) {}

    void operator() (std::size_t begin, std::size_t end) {
        for(std::size_t i = begin; i < end; i++) {
            double new_low = std::numeric_limits<double>::max();
            for(std::size_t j = 0; j < b.nrow(); j++) {
                double dsum = 0;
                for(std::size_t z = 0; z < b.ncol(); z++) {
                    dsum = dsum + pow(a(i,z) - b(j,z), 2);
                }
                dsum = pow(dsum, 0.5);
                if(dsum < new_low) {
                    new_low = dsum;
                }
            }
            medm[i] = new_low;
        }
    }
};


// [[Rcpp::export]]
double mean_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
    Rcpp::NumericVector medm(a.nrow());
    minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
    RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);    
    double results = Rcpp::sum(medm);
    results = results / a.nrow();
    return results;
}


// [[Rcpp::export]]
double max_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
    Rcpp::NumericVector medm(a.nrow());
    minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
    RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);    
    double results = Rcpp::max(medm);
    return results;
}
)

Benchmarks using large point clouds of sizes 37,775 and 36,659:

//Rcpp serial solution
system.time(avg_hausdorff_rcpp(ll,rr))
   user  system elapsed 
409.143   0.000 409.105 

//RcppParallel solution
system.time(mean(mean_directional_hausdorff_rcpp(ll,rr), mean_directional_hausdorff_rcpp(rr,ll)))
   user  system elapsed 
260.712   0.000  33.265 
like image 670
JJL Avatar asked Nov 09 '17 22:11

JJL


People also ask

How do you find the Euclidean distance between two matrices?

The Euclidean distance is simply the square root of the squared differences between corresponding elements of the rows (or columns).

What is the Euclidean distance between these two objects?

In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore occasionally being called the Pythagorean distance.

What is Euclidean distance between two data points?

In Mathematics, the Euclidean distance is defined as the distance between two points. In other words, the Euclidean distance between two points in the Euclidean space is defined as the length of the line segment between two points.

How do you find the distance between two matrices?

If we have two matrices A,B. Distance between A and B can be calculated using Singular values or 2 norms. You may use Distance =|(fnorm(A)−fnorm(B))| where fnorm = sq root of sum of squares of all singular values.


Video Answer


1 Answers

I try to use JuliaCall to do the calculation for the average Hausdorff distance. JuliaCall embeds Julia in R.

I only try a serial solution in JuliaCall. It seems to be faster than the RcppParallel and the Rcpp serial solution in the question, but I don't have the benchmark data. Since ability for parallel computation is built in Julia. A parallel computation version in Julia should be written without much difficulty. I will update my answer after finding that out.

Below is the julia file I wrote:

# Calculate the min distance from the k-th point in as to the points in bs
function min_dist(k, as, bs)
    n = size(bs, 1)
    p = size(bs, 2)
    dist = Inf
    for i in 1:n
        r = 0.0
        for j in 1:p
            r += (as[k, j] - bs[i, j]) ^ 2
            ## if r is already greater than the upper bound, 
            ## then there is no need to continue doing the calculation
            if r > dist
                continue
            end
        end
        if r < dist
            dist = r
        end
    end
    sqrt(dist)
end

function avg_min_dist_from(as, bs)
    distsum = 0.0
    n1 = size(as, 1)
    for k in 1:n1
        distsum += min_dist_from(k, as, bs)
    end
    distsum / n1
end

function hausdorff_avg_dist(as, bs)
    (avg_min_dist_from(as, bs) + avg_min_dist_from(bs, as)) / 2
end

And this is the R code to use the julia function:

first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")

m <- nrow(first_configuration)
n <- nrow(second_configuration)

D <- sqrt(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration))
D

d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)

library(JuliaCall)
## the first time of julia_setup could be quite time consuming
julia_setup()
## source the julia file which has our hausdorff_avg_dist function
julia_source("hausdorff.jl")

## check if the julia function is correct with the example
average_hausdorff_julia <- julia_call("hausdauff_avg_dist",
                                      first_configuration,
                                      second_configuration)
## generate some large random point clouds
n1 <- 37775
n2 <- 36659
as <- matrix(rnorm(n1 * 3), n1, 3)
bs <- matrix(rnorm(n2 * 3), n2, 3)

system.time(julia_call("hausdauff_avg_dist", as, bs))

The time on my laptop was less than 20 seconds, note this is performance of the serial version of JuliaCall! I used the same data to test RCpp serial solution in the question, which took more than 10 minutes to run. I don't have RCpp parallel on my laptop now so I can't try that. And as I said, Julia has built-in ability to do parallel computation.

like image 108
Consistency Avatar answered Sep 24 '22 00:09

Consistency