I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop
.
Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.
rangesA:
chrom start stop
5 100 105
1 200 250
9 275 300
rangesB:
chrom start stop
1 200 265
5 99 106
9 275 290
for each row in rangesA:
matches <- which((rangesB[,'chrom'] == rangesA[row,'chrom']) &&
(rangesB[,'start'] <= rangesA[row, 'start']) &&
(rangesB[,'stop'] >= rangesA[row, 'stop']))
I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?
Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.
1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.
Overlap function is a special type of aggregation function which measures the degree of overlapping between different classes. Recently, complex fuzzy sets have been successfully applied in many applications. In this paper, we extend the concept of overlap functions to the complex-valued setting.
If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.
Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
There are a few appropriate containers for ranges on different chromosomes, one is RangesList
library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0
This would be a lot easier / faster if you can merge the two objects first.
ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
# chrom startA stopA startB stopB
#1 1 200 250 200 265
#2 5 100 105 99 106
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With