Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently construct GRanges/IRanges from Rle vector

I have a Run length encoded vector representing some value at every position on the genome, in order. As a toy example suppose I had just one chromosome of length 10, then I would have a vector looking like

library(GenomicRanges)

set.seed(1)
toyData = Rle(sample(1:3,10,replace=TRUE))

I would like to coerce this into a GRanges object. The best I can come up with is

gr = GRanges('toyChr',IRanges(cumsum(c(0,runLength(toyData)[-nrun(toyData)])),
                              width=runLength(toyData)),
             toyData = runValue(toyData))

which works, but is quite slow. Is there a faster way to construct the same object?

like image 266
user1356855 Avatar asked Aug 30 '16 08:08

user1356855


People also ask

What is an RLE vector in R?

An Rle(run-length-encoded) vector is a specific representation of a vector. The IRangespackage implements support for this class. Watch out: there is also a base R class called rlewhich has much less functionality.

What is a run-length-encoded vector in R?

An Rle(run-length-encoded) vector is a specific representation of a vector. The IRangespackage implements support for this class. Watch out: there is also a base R class called rlewhich has much less functionality. The run-length-encoded representation of a vector, represents the vector as a set of distinct runs with their own value.

What are the functions of RLE in R?

In many ways Rles function as normal vectors, you can do arithmetic with them, transform them etc. using standard R functions like +and log2. There are also RleListwhich is a list of Rles. This class is used to represent a genome wide coverage track where each element of the list is a different chromosome. Useful functions for Rle


Video Answer


1 Answers

As @TheUnfunCat pointed out, the OP's solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost every combination of base R and couldn't beat the efficiency Rle from the S4Vectors package, thus I resorted to Rcpp. Here is the main function :

GenomeRcpp <- function(v) {
    x <- WhichDiffZero(v)
    m <- v[c(1L,x+1L)]
    s <- c(0L,x)
    e <- c(x,length(v))-1L
    GRanges('toyChr',IRanges(start = s, end = e), toyData = m)
}

The WhichDiffZero is the Rcpp function that pretty much does the exact same thing as which(diff(v) != 0) in base R. Much credit goes to @G.Grothendieck.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector WhichDiffZero(IntegerVector x) {
    int nx = x.size()-1;
    std::vector<int> y;
    y.reserve(nx);
    for(int i = 0; i < nx; i++) {
        if (x[i] != x[i+1]) y.push_back(i+1);
    }
    return wrap(y);
}

Below are some benchmarks:

set.seed(437)
testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1))))

microbenchmark(GenomeRcpp(testData), GenomeOrig(testData))
Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval cld
GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773   100   a
GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727   100   a

identical(GenomeRcpp(testData), GenomeOrig(testData))
[1] TRUE

I've been working on this off and on for the past few days and I'm definitely not satisfied. I'm hoping that someone will take what I've done (as it is a different approach) and create something much better.

like image 176
Joseph Wood Avatar answered Sep 26 '22 23:09

Joseph Wood