Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: condense indexes

Tags:

r

I have a vector like the following:

xx <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)

I want to find the indexes that have ones and combine them together. In this case, I want the output to look like 1 6 and 11 14 in a 2x2 matrix. My vector is actually very long so I can't do this by hand. Can anyone help me with this? Thanks.

like image 552
user1938809 Avatar asked Dec 09 '22 14:12

user1938809


2 Answers

Since the question originally had a tag 'bioinformatics' I'll mention the Bioconductor package IRanges (and it's companion for ranges on genomes GenomicRanges)

> library(IRanges)
> xx <- c(1,1,1,1,1,1,0,0,0,0,1,1,1,1)
> sl = slice(Rle(xx), 1)
> sl
Views on a 14-length Rle subject

views:
    start end width
[1]     1   6     6 [1 1 1 1 1 1]
[2]    11  14     4 [1 1 1 1]

which could be coerced to a matrix, but that would often not be convenient for whatever the next step is

> matrix(c(start(sl), end(sl)), ncol=2)
     [,1] [,2]
[1,]    1    6
[2,]   11   14

Other operations might start on the Rle, e.g.,

> xx = c(2,2,2,3,3,3,0,0,0,0,4,4,1,1)
> r = Rle(xx)
> m = cbind(start(r), end(r))[runValue(r) != 0,,drop=FALSE]
> m
     [,1] [,2]
[1,]    1    3
[2,]    4    6
[3,]   11   12
[4,]   13   14

See the help page ?Rle for the full flexibility of the Rle class; to go from a matrix like that above to a new Rle as asked in the comment below, one might create a new Rle of appropriate length and then subset-assign using an IRanges as index

> r = Rle(0L, max(m))
> r[IRanges(m[,1], m[,2])] = 1L
> r
integer-Rle of length 14 with 3 runs
  Lengths: 6 4 4
  Values : 1 0 1

One could expand this to a full vector

> as(r, "integer")
 [1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1

but often it's better to continue the analysis on the Rle. The class is very flexible, so one way of going from xx to an integer vector of 1's and 0's is

> as(Rle(xx) > 0, "integer")
 [1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1

Again, though, it often makes sense to stay in Rle space. And Arun's answer to your separate question is probably best of all.

Performance (speed) is important, although in this case I think the Rle class provides a lot of flexibility that would weigh against poor performance, and ending up at a matrix is an unlikely end-point for a typical analysis. Nonetheles the IRanges infrastructure is performant

eddi <- function(xx)
    matrix(which(diff(c(0,xx,0)) != 0) - c(0,1),
           ncol = 2, byrow = TRUE)

iranges = function(xx) {
    sl = slice(Rle(xx), 1)
    matrix(c(start(sl), end(sl)), ncol=2)
}

iranges.1 = function(xx) {
    r = Rle(xx)
    cbind(start(r), end(r))[runValue(r) != 0, , drop=FALSE]
}

with

> xx = sample(c(0, 1), 1e5, TRUE)
> microbenchmark(eddi(xx), iranges(xx), iranges.1(xx), times=10)
Unit: milliseconds
          expr       min        lq    median        uq      max neval
      eddi(xx)  45.88009  46.69360  47.67374 226.15084 234.8138    10
   iranges(xx) 112.09530 114.36889 229.90911 292.84153 294.7348    10
 iranges.1(xx)  31.64954  31.72658  33.26242  35.52092 226.7817    10
like image 196
Martin Morgan Avatar answered Dec 11 '22 09:12

Martin Morgan


Something like this, maybe?

if (xx[1] == 1) {
    rr <- cumsum(c(0, rle(xx)$lengths))
} else {
    rr <- cumsum(rle(xx)$lengths)
}
if (length(rr) %% 2 == 1) {
    rr <- head(rr, -1)
}
oo <- matrix(rr, ncol=2, byrow=TRUE)
oo[, 1] <- oo[, 1] + 1
     [,1] [,2]
[1,]    1    6
[2,]   11   14

This edit takes care of cases where 1) the vector starts with a "0" rather than a "1" and 2) where the number of consecutive occurrences of 1's are odd/even. For ex: xx <- c(1,1,1,1,1,1,0,0,0,0).

like image 45
Arun Avatar answered Dec 11 '22 07:12

Arun