Let's say I have a set of closed linear intervals represented by this matrix:
interval.mat = matrix(c(1,2,3,5,4,6,8,9), byrow = TRUE, ncol = 2)
where interval.mat[,1]
are the interval start points and interval.mat[,2]
are their corresponding end points.
I'm looking for an efficient (since this example matrix is a toy and in reality my matrix contains a few thousands of intervals) way to produce a matrix that will hold all the pairwise positive distances between the intervals. The distance between a pair of intervals should be the start of the interval with the bigger end among the two minus the end of the interval with the smaller end among the two. For example the distance between intervals c(1,2)
and c(3,5)
should 3 - 2 = 1
, since the second interval ends after the first one. In case the intervals overlap the distance should be 0. So for example, in the case of c(3,5)
and c(4,6)
the distance would be 0.
So, the pairwise distance matrix for the intervals above would be:
> matrix(c(0,1,2,6,1,0,0,3,2,0,0,2,6,3,2,0), byrow = TRUE, nrow = 4, ncol = 4)
[,1] [,2] [,3] [,4]
[1,] 0 1 2 6
[2,] 1 0 0 3
[3,] 2 0 0 2
[4,] 6 3 2 0
Here's an Rcpp solution. It will be fast and memory efficient (for details see below).
First let's define a helper function which calculates all the pairwise distances. If n
is the number of intervals to consider, we have n*(n-1)/2
unique pairs of vectors (we don't take the same intervals into account, of course, as the distance between them is 0).
library('Rcpp')
library('inline')
cppFunction("
NumericVector distint_help(NumericMatrix x) {
int n = x.nrow(); // number of rows
NumericVector out(n*(n-1)/2); // result numeric vector
int k = 0;
for (int i=0; i<n-1; ++i) {
for (int j=i+1; j<n; ++j) {
if (x(i,0) >= x(j,1))
out[k++] = x(i,0)-x(j,1);
else if (x(j,0) > x(i,1))
out[k++] = x(j,0)-x(i,1);
else
out[k++] = 0.0;
}
}
return out;
}
")
The above function returns a numeric vector with the calculated distances. Let's try to mimic the output of the built-in dist
function (checkout the result of x <- dist(interval.mat); unclass(x)
).
Now the main function:
distint <- function(interval) {
stopifnot(is.numeric(interval), is.matrix(interval), ncol(interval) == 2)
res <- distint_help(interval) # use Rcpp to calculate the distances
# return the result similar to the one of dist()
structure(res, class='dist', Size=nrow(interval), Diag=FALSE, Upper=FALSE)
}
distint(interval.mat)
## 1 2 3
## 2 1
## 3 2 0
## 4 6 3 2
The above object may be converted to an "ordinary" square matrix:
as.matrix(distint(interval.mat))
## 1 2 3 4
## 1 0 1 2 6
## 2 1 0 0 3
## 3 2 0 0 2
## 4 6 3 2 0
Unless the distance matrix is sparse (there are many many zeros), the above solution is storage efficient.
A benchmark:
test <- matrix(runif(1000), ncol=2)
library('microbenchmark')
library(proxy)
f <- function(x,y) max(min(x)-max(y),0)
microbenchmark(distint(test), as.matrix(dist(test, method=f)), times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## distint(test) 1.584548 1.615146 1.650645 3.071433 3.164231 10
## as.matrix(dist(test, method = f)) 455.300974 546.438875 551.596582 599.977164 609.418194 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With