I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean <code>rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop</code>. Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution. rangesA: <pre class="prettyprint"><code>chrom start stop 5 100 105 1 200 250 9 275 300 </code></pre> rangesB: <pre class="prettyprint"><code>chrom start stop 1 200 265 5 99 106 9 275 290 </code></pre> for each row in rangesA: <pre class="prettyprint"><code>matches <- which((rangesB[,'chrom'] == rangesA[row,'chrom']) && (rangesB[,'start'] <= rangesA[row, 'start']) && (rangesB[,'stop'] >= rangesA[row, 'stop'])) </code></pre> I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?

Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively) <pre class="prettyprint"><code>source("http://bioconductor.org/biocLite.R") biocLite("IRanges") </code></pre> There are a few appropriate containers for ranges on different chromosomes, one is RangesList <pre class="prettyprint"><code>library(IRanges) rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom) rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom) #which rangesB wholly contain at least one rangesA? ov <- countOverlaps(rangesB, rangesA, type="within")>0 </code></pre>

Finding overlap in ranges with R

Q: How do you calculate range overlap?

Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.

Q: How do you know if two intervals are overlapping?

1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.

Q: What is an overlap function?

Overlap function is a special type of aggregation function which measures the degree of overlapping between different classes. Recently, complex fuzzy sets have been successfully applied in many applications. In this paper, we extend the concept of overlap functions to the complex-valued setting.

Q: What does it mean when ranges overlap?

If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.

Tags:

r

bioinformatics

I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop.

Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.

rangesA:

chrom   start   stop
 5       100     105
 1       200     250
 9       275     300

rangesB:

chrom    start    stop
  1       200      265
  5       99       106
  9       275      290

for each row in rangesA:

matches <- which((rangesB[,'chrom']  == rangesA[row,'chrom']) &&
                 (rangesB[,'start'] <= rangesA[row, 'start']) &&
                 (rangesB[,'stop'] >= rangesA[row, 'stop']))

I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?

770

asked Oct 12 '10 15:10

geoffjentry

2 Answers

Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

There are a few appropriate containers for ranges on different chromosomes, one is RangesList

library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0

183

answered Oct 12 '22 23:10

Aaron Statham

This would be a lot easier / faster if you can merge the two objects first.

ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
#  chrom startA stopA startB stopB
#1     1    200   250    200   265
#2     5    100   105     99   106

answered Oct 13 '22 00:10

Joshua Ulrich

Related questions
                            
                                How to fill in the preceding numbers whenever there is a 0 in R?
                            
                                How to omit rows with NA in only two columns in R?
                            
                                Why the built-in lm function is so slow in R?
                            
                                Rotating x label text in ggplot
                            
                                Find the max date in a single column across multiple rows
                            
                                population pyramid density plot in r
                            
                                Cumulatively paste (concatenate) values grouped by another variable
                            
                                Check if value is in data frame
                            
                                Conditionally replace elements of a vector based on an index
                            
                                Creating grouped bar-plot of multi-column data in R
                            
                                R: Replacing foreign characters in a string
                            
                                R concatenating two factors
                            
                                How to create an edge list from a matrix in R?
                            
                                Create empty data frame with 200 rows and no columns
                            
                                Generate 3 random number that sum to 1 in R
                            
                                3D plot of bivariate distribution using R or Matlab
                            
                                Fill NAs in R with zero if the next valid data point is more than 2 intervals away
                            
                                Using dplyr::filter, how can the output be limited to just first 500 rows?
                            
                                knitr::kable is there a way to reduce the font size?
                            
                                How to draw gauge chart in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With