Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing pairwise distances between a set of intervals

Tags:

r

intervals

Let's say I have a set of closed linear intervals represented by this matrix:

interval.mat = matrix(c(1,2,3,5,4,6,8,9), byrow = TRUE, ncol = 2)

where interval.mat[,1] are the interval start points and interval.mat[,2] are their corresponding end points.

I'm looking for an efficient (since this example matrix is a toy and in reality my matrix contains a few thousands of intervals) way to produce a matrix that will hold all the pairwise positive distances between the intervals. The distance between a pair of intervals should be the start of the interval with the bigger end among the two minus the end of the interval with the smaller end among the two. For example the distance between intervals c(1,2) and c(3,5) should 3 - 2 = 1, since the second interval ends after the first one. In case the intervals overlap the distance should be 0. So for example, in the case of c(3,5) and c(4,6) the distance would be 0.

So, the pairwise distance matrix for the intervals above would be:

> matrix(c(0,1,2,6,1,0,0,3,2,0,0,2,6,3,2,0), byrow = TRUE, nrow = 4, ncol = 4)
     [,1] [,2] [,3] [,4]
[1,]    0    1    2    6
[2,]    1    0    0    3
[3,]    2    0    0    2
[4,]    6    3    2    0
like image 438
user1701545 Avatar asked May 21 '14 17:05

user1701545


1 Answers

Here's an Rcpp solution. It will be fast and memory efficient (for details see below).

First let's define a helper function which calculates all the pairwise distances. If n is the number of intervals to consider, we have n*(n-1)/2 unique pairs of vectors (we don't take the same intervals into account, of course, as the distance between them is 0).

library('Rcpp')
library('inline')    
cppFunction("
   NumericVector distint_help(NumericMatrix x) {
      int n = x.nrow(); // number of rows
      NumericVector out(n*(n-1)/2); // result numeric vector
      int k = 0;
      for (int i=0; i<n-1; ++i) {
         for (int j=i+1; j<n; ++j) {
            if (x(i,0) >= x(j,1))
               out[k++] = x(i,0)-x(j,1);
            else if (x(j,0) > x(i,1))
               out[k++] = x(j,0)-x(i,1);
            else
               out[k++] = 0.0;
         }
      }
      return out;
   }
")

The above function returns a numeric vector with the calculated distances. Let's try to mimic the output of the built-in dist function (checkout the result of x <- dist(interval.mat); unclass(x)).

Now the main function:

distint <- function(interval) {
   stopifnot(is.numeric(interval), is.matrix(interval), ncol(interval) == 2)

   res <- distint_help(interval) # use Rcpp to calculate the distances

   # return the result similar to the one of dist()
   structure(res, class='dist', Size=nrow(interval), Diag=FALSE, Upper=FALSE)
}
distint(interval.mat)
##   1 2 3
## 2 1    
## 3 2 0  
## 4 6 3 2

The above object may be converted to an "ordinary" square matrix:

as.matrix(distint(interval.mat))
##   1 2 3 4
## 1 0 1 2 6
## 2 1 0 0 3
## 3 2 0 0 2
## 4 6 3 2 0

Unless the distance matrix is sparse (there are many many zeros), the above solution is storage efficient.

A benchmark:

test <- matrix(runif(1000), ncol=2)
library('microbenchmark')
library(proxy)
f <- function(x,y) max(min(x)-max(y),0)
microbenchmark(distint(test), as.matrix(dist(test, method=f)), times=10)

## Unit: milliseconds
##                               expr        min         lq     median         uq        max neval
##                      distint(test)   1.584548   1.615146   1.650645   3.071433   3.164231    10
##  as.matrix(dist(test, method = f)) 455.300974 546.438875 551.596582 599.977164 609.418194    10
like image 166
gagolews Avatar answered Sep 30 '22 09:09

gagolews