R optimization: How can I avoid a for loop in this situation?

Tags:

I'm trying to do a simple genomic track intersection in R, and running into major performance problems, probably related to my use of for loops.

In this situation, I have pre-defined windows at intervals of 100bp and I'm trying to calculate how much of each window is covered by the annotations in mylist. Graphically, it looks something like this:

          0    100   200    300    400   500   600  
windows: |-----|-----|-----|-----|-----|-----|

mylist:    |-|   |-----------|

So I wrote some code to do just that, but it's fairly slow and has become a bottleneck in my code:

##window for each 100-bp segment    
windows <- numeric(6)

##second track
mylist = vector("list")
mylist[[1]] = c(1,20)
mylist[[2]] = c(120,320)


##do the intersection
for(i in 1:length(mylist)){
  st <- floor(mylist[[i]][1]/100)+1
  sp <- floor(mylist[[i]][2]/100)+1
  for(j in st:sp){       
    b <- max((j-1)*100, mylist[[i]][1])
    e <- min(j*100, mylist[[i]][2])
    windows[j] <- windows[j] + e - b + 1
  }
}

print(windows)
[1]  20  81 101  21   0   0

Naturally, this is being used on data sets that are much larger than the example I provide here. Through some profiling, I can see that the bottleneck is in the for loops, but my clumsy attempt to vectorize it using *apply functions resulted in code that runs an order of magnitude more slowly.

I suppose I could write something in C, but I'd like to avoid that if possible. Can anyone suggest another approach that will speed this calculation up?

319

asked Mar 25 '10 17:03

chrisamiller

2 Answers

The "Right" thing to do is to use the bioconductor IRanges package, which uses an IntervalTree data structure to represent these ranges.

Having both of your objects in their own IRanges objects, you would then use the findOverlaps function to win.

Get it here:

http://www.bioconductor.org/packages/release/bioc/html/IRanges.html

By the by, the internals of the package are written in C, so its super fast.

EDIT

On second thought, it's not as much of a slam-dunk as I'm suggesting (a one liner), but you should definitely start using this library if you're working at all with genomic intervals (or other types) ... you'll likely need to do some set operations and stuff. Sorry, don't have time to provide the exact answer, though.

I just thought it's important to point this library out to you.

answered Oct 28 '22 17:10

Steve Lianoglou

So I'm not entirely sure why the third and fourth windows aren't 100 and 20 because that would make more sense to me. Here's a one liner for that behavior:

Reduce('+', lapply(mylist, function(x) hist(x[1]:x[2], breaks = (0:6) * 100, plot = F)$counts))

Note that you need to specify the upper bound in breaks, but it shouldn't be hard to make another pass to get it if you don't know it in advance.

answered Oct 28 '22 17:10

Jonathan Chang

Related questions
                            
                                Trying to optimize line vs cylinder intersection
                            
                                C# optimizations and side effects
                            
                                Postgresql index on xpath expression gives no speed up
                            
                                Is there a Python equivalent for Perl's `study`?
                            
                                Static functions inlining in Java
                            
                                Why is writing Excel cell values fast in VBScript but slow in PowerShell?
                            
                                How to write convertible code, 32 bit/64 bit?
                            
                                Loop optimisation techniques in C++
                            
                                Is it possible to get a Java program faster than the same program (optimized) in C? [closed]
                            
                                Optimizing array merge operation
                            
                                while loop, run forever or count down
                            
                                How can I optimize this class that solves this math sequence
                            
                                C++ throw() optimization
                            
                                Optimizing indices for ranking in SQL Server
                            
                                Should I truncate my ints to shorts before computing bitwise ops on them?
                            
                                Code optimisation: Arrays vs collections
                            
                                Optimize Windows Form Load Time
                            
                                Under what conditions does ROWNUM=1 significantly increase performance in an "exists" syle query
                            
                                User Defined Functions in Excel and Speed Issues
                            
                                Why is squid good for REST architectures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R optimization: How can I avoid a for loop in this situation?

Tags:

optimization

r

intersection

bioinformatics

chrisamiller

People also ask

2 Answers

Steve Lianoglou

Jonathan Chang

Recent Activity

Donate For Us