Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subset slow in large matrix

Tags:

r

matrix

subset

I have a numeric vector of length 5,000,000

>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865

and a 3 x 1,400,000 numeric matrix

>head(subscores)
        V1       V2     V3
1 47286730 47286725  0.830
2 47286740 47286791  0.065
3 47286750 47286806 -0.165
4 47288371 47288427  0.760
5 47288841 47288890  0.285
6 47288896 47288945  0.225

What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:

results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
    select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}

This is very slow, and would take a few days to finish. Is there a faster way?

Thanks,

Dan

like image 357
dlv Avatar asked Feb 17 '23 19:02

dlv


2 Answers

I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)

library(IRanges)

creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same

coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))

q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)

Here we find which query overlaps which subject

system.time({
    olaps <- findOverlaps(q, s)
})

This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement. The result is a pair of vectors indexing the query and overlapping subject.

> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
       queryHits subjectHits 
        <integer>   <integer> 
 1             19      685913 
 2             35      929424 
 3             46     1130191 
 4             52       37417 

I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)

The next challenging part is calculating a large number of means. The built-in way would be something like

olaps0 <- head(olaps, 10000)
system.time({
    res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})

which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as

require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
                 score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1

which takes about 2.5s for all 280k overlaps.

Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run

idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)

and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run

scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))

And finally, the mean is

res2 <- xt / n

This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result

> identical(res1, res2)
[1] TRUE

The original coordinates corresponding to the means are

start(q)[ queryHits(olaps)[idx] ]
like image 195
Martin Morgan Avatar answered Mar 02 '23 19:03

Martin Morgan


Something like this might be faster :

require(data.table)
subscores <- as.data.table(subscores)

subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)] 

list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table

like image 31
Michael Avatar answered Mar 02 '23 19:03

Michael