Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

overlapping intervals in a dataframe in r

Tags:

dataframe

r

I am trying to work on genomic data with R, and I have seen a couple of topics with quite good answers related to two dataframes and overlapping intervals. My problem is that I have one dataframe with overlapping intervals, which I would like to merge, i.e:

chrom   start   stop
 5       100     105
 5       100     105
 5       200     300
 9       275     300
 9       280     301

I would like to end up with something like this:

chrom   start   stop
 5       100     105
 5       200     300
 9       275     301

I am also trying to become better at coding, so I was wondering what would be the most elegant way to do it. Hope this is not redundant with some other query,

like image 924
Max_IT Avatar asked Oct 28 '15 16:10

Max_IT


People also ask

How do you find overlapping dates in R?

use int_overlaps() to check if two intervals overlap. It returns TRUE if the intervals overlap else FALSE .

How do you fix overlapping intervals?

A simple approach is to start from the first interval and compare it with all other intervals for overlapping, if it overlaps with any other interval, then remove the other interval from the list and merge the other into the first interval. Repeat the same steps for the remaining intervals after the first.

How do you check if an interval is covered by another?

1) Sort all intervals in increasing order of start time. This step takes O(n Logn) time. 2) In the sorted array, if the end time of an interval is not more than the end of the previous interval, then there is a complete overlap. This step takes O(n) time.

What is an overlapping interval?

Let's take the following overlapping intervals example to explain the idea: If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.


1 Answers

Using GenomicRanges::reduce:

require(GenomicRanges)
as.data.frame(reduce(GRanges(df$chrom, IRanges(df$start, df$stop))))
#   seqnames start end width strand
# 1        5   100 105     6      *
# 2        5   200 300   101      *
# 3        9   275 301    27      *

It's also possible using data.table::foverlaps or GenomicRanges::findOverlaps, but not as straightforward.

like image 167
Arun Avatar answered Nov 03 '22 04:11

Arun