I have a Run length encoded vector representing some value at every position on the genome, in order. As a toy example suppose I had just one chromosome of length 10, then I would have a vector looking like
library(GenomicRanges)
set.seed(1)
toyData = Rle(sample(1:3,10,replace=TRUE))
I would like to coerce this into a GRanges object. The best I can come up with is
gr = GRanges('toyChr',IRanges(cumsum(c(0,runLength(toyData)[-nrun(toyData)])),
width=runLength(toyData)),
toyData = runValue(toyData))
which works, but is quite slow. Is there a faster way to construct the same object?
An Rle(run-length-encoded) vector is a specific representation of a vector. The IRangespackage implements support for this class. Watch out: there is also a base R class called rlewhich has much less functionality.
An Rle(run-length-encoded) vector is a specific representation of a vector. The IRangespackage implements support for this class. Watch out: there is also a base R class called rlewhich has much less functionality. The run-length-encoded representation of a vector, represents the vector as a set of distinct runs with their own value.
In many ways Rles function as normal vectors, you can do arithmetic with them, transform them etc. using standard R functions like +and log2. There are also RleListwhich is a list of Rles. This class is used to represent a genome wide coverage track where each element of the list is a different chromosome. Useful functions for Rle
As @TheUnfunCat pointed out, the OP's solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost every combination of base R
and couldn't beat the efficiency Rle
from the S4Vectors
package, thus I resorted to Rcpp
. Here is the main function :
GenomeRcpp <- function(v) {
x <- WhichDiffZero(v)
m <- v[c(1L,x+1L)]
s <- c(0L,x)
e <- c(x,length(v))-1L
GRanges('toyChr',IRanges(start = s, end = e), toyData = m)
}
The WhichDiffZero
is the Rcpp
function that pretty much does the exact same thing as which(diff(v) != 0)
in base R
. Much credit goes to @G.Grothendieck.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector WhichDiffZero(IntegerVector x) {
int nx = x.size()-1;
std::vector<int> y;
y.reserve(nx);
for(int i = 0; i < nx; i++) {
if (x[i] != x[i+1]) y.push_back(i+1);
}
return wrap(y);
}
Below are some benchmarks:
set.seed(437)
testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1))))
microbenchmark(GenomeRcpp(testData), GenomeOrig(testData))
Unit: milliseconds
expr min lq mean median uq max neval cld
GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773 100 a
GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727 100 a
identical(GenomeRcpp(testData), GenomeOrig(testData))
[1] TRUE
I've been working on this off and on for the past few days and I'm definitely not satisfied. I'm hoping that someone will take what I've done (as it is a different approach) and create something much better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With