Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate Euclidean distance matrix using a big.matrix object

I have an object of class big.matrix in R with dimension 778844 x 2. The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix and have as a result an object of class big.matrix. I would like to know if there is an optimal way of doing that.

The reason for my choice of using the class big.matrix is memory limitation. I could transform my big.matrix to an object of class matrix and calculate the Euclidean distance matrix using dist(). However, dist() would return an object of size that would not be allocated in the memory.

Edit

The following answer was given by John W. Emerson, author and maintainer of the bigmemory package:

You could use big algebra I expect, but this would also be a very nice use case for Rcpp via sourceCpp(), and very short and easy. But in short, we don't even attempt to provide high-level features (other than the basics which we implemented as proof-of-concept). No single algorithm could cover all use cases once you start talking out-of-memory big.

like image 333
Samuel-Rosa Avatar asked Nov 16 '14 15:11

Samuel-Rosa


People also ask

How do you calculate Euclidean distance manually?

Determine the Euclidean distance between two points (a, b) and (-a, -b). d = 2√(a2+b2). Hence, the distance between two points (a, b) and (-a, -b) is 2√(a2+b2).

How do you find the distance between two matrices?

The Euclidean distance is simply the square root of the squared differences between corresponding elements of the rows (or columns). This is probably the most commonly used distance metric.


1 Answers

Here is a way using RcppArmadillo. Much of this is very similar to the RcppGallery example. This will return a big.matrix with the associated pairwise (by row) euclidean distances. I like to wrap my big.matrix functions in a wrapper function to create a cleaner syntax (i.e. avoid the @address and other initializations.

Note - as we are using bigmemory (and therefore concerned with RAM usage) I have this example returned the N-1 x N-1 matrix of only lower triangular elements. You could modify this but this is what I threw together.

euc_dist.cpp

// To enable the functionality provided by Armadillo's various macros,
// simply include them before you include the RcppArmadillo headers.
#define ARMA_NO_DEBUG

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo, BH, bigmemory)]]

using namespace Rcpp;
using namespace arma;

// The following header file provides the definitions for the BigMatrix
// object
#include <bigmemory/BigMatrix.h>

// C++11 plugin
// [[Rcpp::plugins(cpp11)]]

template <typename T>
void BigArmaEuclidean(const Mat<T>& inBigMat, Mat<T> outBigMat) {

  int W = inBigMat.n_rows;

  for(int i = 0; i < W - 1; i++){
      for(int j=i+1; j < W; j++){
          outBigMat(j-1,i) = sqrt(sum(pow((inBigMat.row(i) - inBigMat.row(j)),2)));
      }
  }
}

// [[Rcpp::export]]
void BigArmaEuc(SEXP pInBigMat, SEXP pOutBigMat) {
  // First we tell Rcpp that the object we've been given is an external
  // pointer.
  XPtr<BigMatrix> xpMat(pInBigMat);
  XPtr<BigMatrix> xpOutMat(pOutBigMat);


  int type = xpMat->matrix_type();
  switch(type) {
      case 1:
        BigArmaEuclidean(
            arma::Mat<char>((char *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
            arma::Mat<char>((char *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 2:
        BigArmaEuclidean(
          arma::Mat<short>((short *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<short>((short *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 4:
        BigArmaEuclidean(
          arma::Mat<int>((int *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<int>((int *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      case 8:
        BigArmaEuclidean(
          arma::Mat<double>((double *)xpMat->matrix(), xpMat->nrow(), xpMat->ncol(), false),
          arma::Mat<double>((double *)xpOutMat->matrix(), xpOutMat->nrow(), xpOutMat->ncol(), false)
        );
        return;

      default:
        // We should never get here, but it resolves compiler warnings.
        throw Rcpp::exception("Undefined type for provided big.matrix");
  }

}

My little wrapper

bigMatrixEuc <- function(bigMat){
    zeros <- big.matrix(nrow = nrow(bigMat)-1,
                        ncol = nrow(bigMat)-1,
                        init = 0,
                        type = typeof(bigMat))
    BigArmaEuc(bigMat@address, zeros@address)
    return(zeros)
}

The test

library(Rcpp)
sourceCpp("euc_dist.cpp")

library(bigmemory)

set.seed(123)
mat <- matrix(rnorm(16), 4)
bm <- as.big.matrix(mat)

# Call new euclidean function
bm_out <- bigMatrixEuc(bm)[]

# pull out the matrix elements for out purposes
distMat <- as.matrix(dist(mat))
distMat[upper.tri(distMat, diag=TRUE)] <- 0
distMat <- distMat[2:4, 1:3]

# check if identical 
all.equal(bm_out, distMat, check.attributes = FALSE)
[1] TRUE
like image 125
cdeterman Avatar answered Oct 01 '22 03:10

cdeterman