Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge overlapping integer vector elements of a list in R

Tags:

r

I have a list of vectors:

l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20, 20:21, 21:22, 
        23:24, 23:25, 26:27, 30:31, 31:32, 33:34, 35:36, 38:39, 42:43, 
        44:45, 46:47, 50:51, 54:55, 55:56, 57:58, 59:60, 64:65, 66:67, 
        68:69, 69:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91, 
        92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)

Where there are vectors that overlap with eachother (inclusive), I need them to be merged (so reducing the length of the list) so that the widest range possible is covered.

For e.g. taking the first 7 elements of l1,

l1 <- list(2:3, 4:5, 6:7, 8:9, 16:19, 15:19, 18:20,...

I want this transformed into:

l2 <- list(2:3, 4:5, 6:7, 8:9, 15:20,...

How can I do this efficiently?

like image 870
user3313178 Avatar asked Feb 21 '14 20:02

user3313178


3 Answers

Here's a solution - first adjust the ends of each vector so that the vectors are slightly farther apart from each other, then unlist and find all the entries that are closer than 1 together:

# create a sorted vector adjusted end values
vec <- sort(unlist(lapply(l1, function(x) c(x[1] + 0.1,
                                            head(tail(x, -1), -1),
                                            tail(x, 1) - 0.1))))

# split vector if the difference between values is greater than 1
# then convert back to integer and remove the duplicates
lapply(split(vec, c(0, cumsum(diff(vec) > 1))), function(x) unique(round(x)))

The result:

$`0`
[1] 2 3

$`1`
[1] 4 5

$`2`
[1] 6 7

$`3`
[1] 8 9

$`4`
[1] 15 16 17 18 19 20 21 22

$`5`
[1] 23 24 25

$`6`
[1] 26 27

$`7`
[1] 30 31 32

$`8`
[1] 33 34

$`9`
[1] 35 36

$`10`
[1] 38 39

$`11`
[1] 42 43

$`12`
[1] 44 45

$`13`
[1] 46 47

$`14`
[1] 50 51

$`15`
[1] 54 55 56

$`16`
[1] 57 58

$`17`
[1] 59 60

$`18`
[1] 64 65

$`19`
[1] 66 67

$`20`
[1] 68 69 70

$`21`
[1] 73 74

$`22`
[1] 77 78

$`23`
[1] 80 81

$`24`
[1] 82 83

$`25`
[1] 84 85

$`26`
[1] 88 89

$`27`
[1] 90 91

$`28`
[1] 92 93

$`29`
[1] 94 95

$`30`
[1] 96 97

$`31`
[1] 100 101

$`32`
[1] 103 104

$`33`
[1] 105 106

$`34`
[1] 107 108
like image 151
Sven Hohenstein Avatar answered Nov 15 '22 10:11

Sven Hohenstein


A naive approach could be this:

l2 <- l1
for(i in seq_along(l2)[-length(l2)]) {
   if(length(intersect(l2[[i]], l2[[i+1]])) > 0) { 
      l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
      l2[[i]] <- as.list(NULL)
   }   
}
dput(Filter(function(x) length(x) > 0, l2))
list(2:3, 4:5, 6:7, 8:9, 15:22, 23:25, 26:27, 30:32, 33:34, 35:36, 
    38:39, 42:43, 44:45, 46:47, 50:51, 54:56, 57:58, 59:60, 64:65, 
    66:67, 68:70, 73:74, 77:78, 80:81, 82:83, 84:85, 88:89, 90:91, 
    92:93, 94:95, 96:97, 100:101, 103:104, 105:106, 107:108)
like image 35
alexis_laz Avatar answered Nov 15 '22 11:11

alexis_laz


The data seem to be compactly represented as ranges rather than the explicit vectors

rng = matrix(c(sapply(l1, min), sapply(l1, max)), ncol=2,
             dimnames=list(NULL, c("start", "end")))

This seems to be a better representation, even for the results, so we use it throughout and in contrast to the phrasing of the original question. A pure R solution for relatively dense ranges whose maximum number is not too long (say, millions) is to tabulate the occurrence of ends and starts across the overall range

ends = tabulate(rng[,"end"])
starts = tabulate(rng[,"start"], length(ends))

find the 'coverage', where the cumulative number of starts is greater than the cummulative number of ends

coverage = cumsum(starts - ends) != 0

and calculate the beginnings and ends of these ranges

change = diff(coverage)
beg = 1 + which(change == 1)
end = 1 + which(change == -1)

leading to

f0 = function(rng) {
    ends <- tabulate(rng[, "end"])
    starts <- tabulate(rng[, "start"], length(ends))
    coverage <- cumsum(starts - ends)
    change <- diff(c(0, coverage) != 0)
    beg <- which(change == 1)
    end <- which(change == -1)
    matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}

and

> head(f0(rng))
     start end
[1,]     2   3
[2,]     4   5
[3,]     6   7
[4,]     8   9
[5,]    15  22
[6,]    23  25

Perhaps the ranges are sparse or non-integer valued, when a strategy might instead be to tag the ordered start and end coordinates with a 1 or minus 1, and to take a similar strategy of calculating coverage

f1 <- function(rng) {
    o <- order(rng)
    bounds <- c(rep(1, nrow(rng)), rep(-1, nrow(rng)))[o]
    coverage <- cumsum(bounds)
    change <- diff(c(0, coverage != 0))
    orng <- rng[o]
    beg <- orng[change == 1]
    end <- orng[change == -1]
    matrix(c(beg, end), ncol=2, dimnames=list(NULL, c("start", "end")))
}    

Rather than these ad hoc solutions, the Bioconductor IRanges package provides a tested alternative, performing a 'reduction' (exactly the operation of interest, reducing overlapping ranges to their largest enclosing range) on the ranges.

library(IRanges)
f2 <- function(rng) {
    r <- reduce(IRanges(rng[,1], rng[,2]), min.gapwidth=0)
    matrix(c(start(r), end(r)), ncol=2,
           dimnames=list(NULL, c("start", "end")))
}

I guess none of these solutions is exactly right, as apparently the ranges 18:20, 20:21, ... are not supposed to overlap...

By way of validity, we have

> identical(f0(rng), f1(rng))
[1] TRUE
> identical(f0(rng), f2(rng))
[1] TRUE

Results from the other solutions aren't exactly comparable, but implementing them as

f3 <- function(l2) {
    for(i in seq_along(l2)[-length(l2)]) {
        if(length(intersect(l2[[i]], l2[[i+1]])) > 0) { 
            l2[[i+1]] <- sort.int(unique(c(l2[[i]], l2[[i+1]])))
            l2[[i]] <- as.list(NULL)
        }   
    }
    Filter(function(x) length(x) > 0, l2)
}

f4 <- function(l1) {
    vec <- sort(unlist(lapply(l1, function(x) {
        c(x[1] + 0.1, head(tail(x, -1), -1), tail(x, 1) - 0.1)
    })))
    lapply(split(vec, c(0, cumsum(diff(vec) > 1))),
           function(x) unique(round(x)))
}

shows timings

> library(microbenchmark)
> microbenchmark(f0(rng), f1(rng), f2(rng), f3(l1), f4(l1))
Unit: microseconds
    expr      min         lq    median         uq       max neval
 f0(rng)  168.740   184.8365   196.598   206.9565   235.353   100
 f1(rng)  478.184   518.8550   565.973   594.1910   681.029   100
 f2(rng)  906.578   969.1530  1026.590  1119.5225  1201.842   100
  f3(l1) 4341.560  4600.6330  4644.767  4696.1170  5225.190   100
  f4(l1) 9652.549 10220.5320 10275.517 10364.2365 11439.372   100

The solutions f0 - f2 are for different domains and in particular the IRanges solution is likely both robust, flexible (more than just 'reduce'!) and performant for large data sets.

like image 33
Martin Morgan Avatar answered Nov 15 '22 10:11

Martin Morgan