Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding overlap in ranges with R

I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop.

Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.

rangesA:

chrom   start   stop
 5       100     105
 1       200     250
 9       275     300

rangesB:

chrom    start    stop
  1       200      265
  5       99       106
  9       275      290

for each row in rangesA:

matches <- which((rangesB[,'chrom']  == rangesA[row,'chrom']) &&
                 (rangesB[,'start'] <= rangesA[row, 'start']) &&
                 (rangesB[,'stop'] >= rangesA[row, 'stop']))

I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?

like image 770
geoffjentry Avatar asked Oct 12 '10 15:10

geoffjentry


People also ask

How do you calculate range overlap?

Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.

How do you know if two intervals are overlapping?

1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.

What is an overlap function?

Overlap function is a special type of aggregation function which measures the degree of overlapping between different classes. Recently, complex fuzzy sets have been successfully applied in many applications. In this paper, we extend the concept of overlap functions to the complex-valued setting.

What does it mean when ranges overlap?

If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.


2 Answers

Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

There are a few appropriate containers for ranges on different chromosomes, one is RangesList

library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0
like image 183
Aaron Statham Avatar answered Oct 12 '22 23:10

Aaron Statham


This would be a lot easier / faster if you can merge the two objects first.

ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
#  chrom startA stopA startB stopB
#1     1    200   250    200   265
#2     5    100   105     99   106
like image 39
Joshua Ulrich Avatar answered Oct 13 '22 00:10

Joshua Ulrich