Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overlap ranges in single dataframe

Tags:

r

I am looking to flag rows in my df that have overlapping ranges (looking to create the Overlap Column) based a range of numeric variables (Min,Max), which I could transform into integer if necessary:

Class    Min  Max
    A    100  200
    A    120  205
    A    210  310
    A    500  630
    A    510  530
    A    705  800

Transform into:

Class    Min  Max  Overlap
    A    100  200        1
    A    120  205        1
    A    210  310        0
    A    500  630        1
    A    510  530        1
    A    705  800        0

I have tried IRanges without much success - any ideas?

like image 208
user5316628 Avatar asked Oct 19 '16 10:10

user5316628


2 Answers

I find data.table very effective for doing overlaps, using foverlaps

 library(data.table)

Recreating the data:

dt <- data.table(Class = c("A", "A", "A", "A", "A", "A"),
           Min = c(100, 120, 210, 500, 510, 705),
           Max = c(200, 205, 310, 630, 530, 800))

Keying the data.table, this is required for the function:

setkey(dt, Min, Max)

here we do foverlaps against itself, then filter, removing those rows which are overlapping with themselves. The number of rows are then counted grouped by Min and Max.

dt_overlaps <- foverlaps(dt, dt, type = "any")[Min != i.Min & Max != i.Max, .(Class, Overlap = .N), by = c("Min", "Max")]

Thanks to DavidArenburg

dt[dt_overlaps, Overlap := 1]

Results:

> dt
  Class Min Max Overlap
1     A 100 200       1
2     A 120 205       1
3     A 210 310      NA
4     A 500 630       1
5     A 510 530       1
6     A 705 800      NA

There is probably neater data.table code for this, but I'm learning as well.

like image 136
zacdav Avatar answered Oct 05 '22 22:10

zacdav


outer is my function of choice for doing pairwise comparisons fast. You can create the pairwise comparison of the interval endpoints using outer and then combine the comparisons in any way you want. In this case I check if the two rules required for an overlap hold true simultaneously.

library(dplyr)

df_foo = read.table(
textConnection("Class    Min  Max
A    100  200
A    120  205
A    210  310
A    500  630
A    510  530
A    705  800"), header = TRUE
)

c = outer(df_foo$Max, df_foo$Min, ">")
d = outer(df_foo$Min, df_foo$Max, "<")

df_foo %>% 
  mutate(Overlap = apply(c & d, 1, sum) > 1 
)
like image 45
tchakravarty Avatar answered Oct 05 '22 23:10

tchakravarty