I am trying to work on genomic data with R, and I have seen a couple of topics with quite good answers related to two dataframes and overlapping intervals. My problem is that I have one dataframe with overlapping intervals, which I would like to merge, i.e:
chrom start stop
5 100 105
5 100 105
5 200 300
9 275 300
9 280 301
I would like to end up with something like this:
chrom start stop
5 100 105
5 200 300
9 275 301
I am also trying to become better at coding, so I was wondering what would be the most elegant way to do it. Hope this is not redundant with some other query,
use int_overlaps() to check if two intervals overlap. It returns TRUE if the intervals overlap else FALSE .
A simple approach is to start from the first interval and compare it with all other intervals for overlapping, if it overlaps with any other interval, then remove the other interval from the list and merge the other into the first interval. Repeat the same steps for the remaining intervals after the first.
1) Sort all intervals in increasing order of start time. This step takes O(n Logn) time. 2) In the sorted array, if the end time of an interval is not more than the end of the previous interval, then there is a complete overlap. This step takes O(n) time.
Let's take the following overlapping intervals example to explain the idea: If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.
Using GenomicRanges::reduce:
require(GenomicRanges)
as.data.frame(reduce(GRanges(df$chrom, IRanges(df$start, df$stop))))
# seqnames start end width strand
# 1 5 100 105 6 *
# 2 5 200 300 101 *
# 3 9 275 301 27 *
It's also possible using data.table::foverlaps
or GenomicRanges::findOverlaps
, but not as straightforward.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With