I have a list of vectors:
l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20, 20:21, 21:22,
23:24, 23:25, 26:27, 30:31, 31:32, 33:34, 35:36, 38:39, 42:43,
44:45, 46:47, 50:51, 54:55, 55:56, 57:58, 59:60, 64:65, 66:67,
68:69, 69:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91,
92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)
Where there are vectors that overlap with eachother (inclusive), I need them to be merged (so reducing the length of the list) so that the widest range possible is covered.
For e.g. taking the first 7 elements of l1
,
l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20,...
I want this transformed into:
l2 <- list(2:3, 4:5, 6:7, 8:9, 15:20,...
How can I do this efficiently?
Here's a solution - first adjust the ends of each vector so that the vectors are slightly farther apart from each other, then unlist and find all the entries that are closer than 1 together:
# create a sorted vector adjusted end values
vec <- sort(unlist(lapply(l1, function(x) c(x[1] + 0.1,
head(tail(x, -1), -1),
tail(x, 1) - 0.1))))
# split vector if the difference between values is greater than 1
# then convert back to integer and remove the duplicates
lapply(split(vec, c(0, cumsum(diff(vec) > 1))), function(x) unique(round(x)))
The result:
$`0`
[1] 2 3
$`1`
[1] 4 5
$`2`
[1] 6 7
$`3`
[1] 8 9
$`4`
[1] 15 16 17 18 19 20 21 22
$`5`
[1] 23 24 25
$`6`
[1] 26 27
$`7`
[1] 30 31 32
$`8`
[1] 33 34
$`9`
[1] 35 36
$`10`
[1] 38 39
$`11`
[1] 42 43
$`12`
[1] 44 45
$`13`
[1] 46 47
$`14`
[1] 50 51
$`15`
[1] 54 55 56
$`16`
[1] 57 58
$`17`
[1] 59 60
$`18`
[1] 64 65
$`19`
[1] 66 67
$`20`
[1] 68 69 70
$`21`
[1] 73 74
$`22`
[1] 77 78
$`23`
[1] 80 81
$`24`
[1] 82 83
$`25`
[1] 84 85
$`26`
[1] 88 89
$`27`
[1] 90 91
$`28`
[1] 92 93
$`29`
[1] 94 95
$`30`
[1] 96 97
$`31`
[1] 100 101
$`32`
[1] 103 104
$`33`
[1] 105 106
$`34`
[1] 107 108
A naive approach could be this:
l2 <- l1
for(i in seq_along(l2)[-length(l2)]) {
if(length(intersect(l2[[i]], l2[[i+1]])) > 0) {
l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
l2[[i]] <- as.list(NULL)
}
}
dput(Filter(function(x) length(x) > 0, l2))
list(2:3, 4:5, 6:7, 8:9, 15:22, 23:25, 26:27, 30:32, 33:34, 35:36,
38:39, 42:43, 44:45, 46:47, 50:51, 54:56, 57:58, 59:60, 64:65,
66:67, 68:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91,
92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)
The data seem to be compactly represented as ranges rather than the explicit vectors
rng = matrix(c(sapply(l1, min), sapply(l1, max)), ncol=2,
dimnames=list(NULL, c("start", "end")))
This seems to be a better representation, even for the results, so we use it throughout and in contrast to the phrasing of the original question. A pure R solution for relatively dense ranges whose maximum number is not too long (say, millions) is to tabulate the occurrence of ends and starts across the overall range
ends = tabulate(rng[,"end"])
starts = tabulate(rng[,"start"], length(ends))
find the 'coverage', where the cumulative number of starts is greater than the cummulative number of ends
coverage = cumsum(starts - ends) != 0
and calculate the beginnings and ends of these ranges
change = diff(coverage)
beg = 1 + which(change == 1)
end = 1 + which(change == -1)
leading to
f0 = function(rng) {
ends <- tabulate(rng[, "end"])
starts <- tabulate(rng[, "start"], length(ends))
coverage <- cumsum(starts - ends)
change <- diff(c(0, coverage) != 0)
beg <- which(change == 1)
end <- which(change == -1)
matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}
and
> head(f0(rng))
start end
[1,] 2 3
[2,] 4 5
[3,] 6 7
[4,] 8 9
[5,] 15 22
[6,] 23 25
Perhaps the ranges are sparse or non-integer valued, when a strategy might instead be to tag the ordered start and end coordinates with a 1 or minus 1, and to take a similar strategy of calculating coverage
f1 <- function(rng) {
o <- order(rng)
bounds <- c(rep(1, nrow(rng)), rep(-1, nrow(rng)))[o]
coverage <- cumsum(bounds)
change <- diff(c(0, coverage != 0))
orng <- rng[o]
beg <- orng[change == 1]
end <- orng[change == -1]
matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}
Rather than these ad hoc solutions, the Bioconductor IRanges package provides a tested alternative, performing a 'reduction' (exactly the operation of interest, reducing overlapping ranges to their largest enclosing range) on the ranges.
library(IRanges)
f2 <- function(rng) {
r <- reduce(IRanges(rng[,1], rng[,2]), min.gapwidth=0)
matrix(c(start(r), end(r)), ncol=2,
dimnames=list(NULL, c("start", "end")))
}
I guess none of these solutions is exactly right, as apparently the ranges 18:20, 20:21, ... are not supposed to overlap...
By way of validity, we have
> identical(f0(rng), f1(rng))
[1] TRUE
> identical(f0(rng), f2(rng))
[1] TRUE
Results from the other solutions aren't exactly comparable, but implementing them as
f3 <- function(l2) {
for(i in seq_along(l2)[-length(l2)]) {
if(length(intersect(l2[[i]], l2[[i+1]])) > 0) {
l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
l2[[i]] <- as.list(NULL)
}
}
Filter(function(x) length(x) > 0, l2)
}
f4 <- function(l1) {
vec <- sort(unlist(lapply(l1, function(x) {
c(x[1] + 0.1, head(tail(x, -1), -1), tail(x, 1) - 0.1)
})))
lapply(split(vec, c(0, cumsum(diff(vec) > 1))),
function(x) unique(round(x)))
}
shows timings
> library(microbenchmark)
> microbenchmark(f0(rng), f1(rng), f2(rng), f3(l1), f4(l1))
Unit: microseconds
expr min lq median uq max neval
f0(rng) 168.740 184.8365 196.598 206.9565 235.353 100
f1(rng) 478.184 518.8550 565.973 594.1910 681.029 100
f2(rng) 906.578 969.1530 1026.590 1119.5225 1201.842 100
f3(l1) 4341.560 4600.6330 4644.767 4696.1170 5225.190 100
f4(l1) 9652.549 10220.5320 10275.517 10364.2365 11439.372 100
The solutions f0 - f2 are for different domains and in particular the IRanges solution is likely both robust, flexible (more than just 'reduce'!) and performant for large data sets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With