In R, what's an efficient way to extract the integers from ranges?
Let's say I have a matrix of ranges (column1=start, column2=end)
1 5
3 6
10 13
I would like to store the encompassing unique integers of all the ranges in the matrix into an object:
1
2
3
4
5
6
10
11
12
13
This would be applied to a matrix containing ~4 million ranges, so hopefully someone can offer a solution that is somewhat efficient.
Suppose you had start = 3, end = 7, and you'd marked each as a '1' on a number line starting at 1
starts: 0 0 1 0 0 0 0 0 0 ...
ends + 1: 0 0 0 0 0 0 0 1 0 ...
The cumulative sum of the starts minus the cumulative sum of the ends, and the difference between the two, is
cumsum(starts): 0 0 1 1 1 1 1 1 1 ...
cumsum(ends + 1): 0 0 0 0 0 0 0 1 1 ...
diff: 0 0 1 1 1 1 1 0 0
and the locations of the 1's in the diff are
which(diff > 0): 3 4 5 6 7
Use tabulate to allow for multiple starts / ends at the same location, and
range2 <- function(ranges)
{
max <- max(ranges)
starts <- tabulate(ranges[,1], max)
ends <- tabulate(ranges[,2] + 1L, max)
which(cumsum(starts) - cumsum(ends) > 0L)
}
For the question, this gives
> eg <- matrix(c(1, 3, 10, 5, 6, 13), 3)
> range2(eg)
[1] 1 2 3 4 5 6 10 11 12 13
It is pretty fast, for Andrie's example
> system.time(runs <- range2(xx))
user system elapsed
0.108 0.000 0.111
(this sounds a bit like DNA sequence analysis, for which GenomicRanges might be your friend; you'd use the coverage
and slice
functions on reads, perhaps input with readGappedAlignments
).
I don't know that it is particularly efficient, but if your matrix of ranges is ranges
then the following should work:
unique(unlist(apply(ranges, 1, function(x) x[1]:x[2])))
Use sequence
and rep
:
x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)
ranges <- function(x){
len <- x[, 2] - x[, 1] + 1
#allocate space
a <- b <- vector("numeric", sum(len))
a <- rep(x[, 1], len)
b <- sequence(len)-1
unique(a+b)
}
ranges(x)
[1] 1 2 3 4 5 6 10 11 12 13
Since this makes use of only vectorised code, this should be quite fast, even for large data sets. On my machine an input matrix of 1 million rows takes ~5 seconds to run:
set.seed(1)
xx <- sample(1e6, 1e6)
xx <- matrix(c(xx, xx+sample(1:100, 1e6, replace=TRUE)), ncol=2)
str(xx)
int [1:1000000, 1:2] 265509 372124 572853 908206 201682 898386 944670 660794 629110 61786 ...
system.time(zz <- ranges(xx))
user system elapsed
4.33 0.78 5.22
str(zz)
num [1:51470518] 265509 265510 265511 265512 265513 ...
Is it not something as simple as:
x <- matrix(c(1, 5, 3, 6, 10, 13), ncol=2, byrow=TRUE)
do.call(":",as.list(range(x)))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
Edit
Looks like I got the wrong end of the stick, but my answer can be modified to use union
, although this is just a wrapper for unique
:
Reduce("union",apply(x,1,function(y) do.call(":",as.list(y))))
[1] 1 2 3 4 5 6 10 11 12 13
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With