Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can we check if any 2 intervals of a unique ID overlaps?

I have data of patient prescription of oral DM drugs, i.e. DPP4 and SU, and would like to find out if patients had taken the drugs concurrently (i.e. whether there are overlapping intervals for DPP4 and SU within the same patient ID).

Sample data:

  ID DRUG      START        END
1  1 DPP4 2020-01-01 2020-01-20
2  1 DPP4 2020-03-01 2020-04-01
3  1   SU 2020-03-15 2020-04-30
4  2   SU 2020-10-01 2020-10-31
5  2 DPP4 2020-12-01 2020-12-31

In the sample data above,

  1. ID == 1, patient had DPP4 and SU concurrently from 2020-03-15 to 2020-04-01.
  2. ID == 2, patient had consumed both medications at separate intervals.

I thought of splitting the data into 2, one for DPP4 and another for SU. Then, do a full join, and compare each DPP4 interval with each SU interval. This may be okay for small data, but if a patient has like 5 rows for DPP4 and another 5 for SU, we will have 25 comparisons, which may not be efficient. Add that with 10000+ patients.

I am not sure how to do it.

New data:

Hope to have a new df that looks like this. Or anything that is tidy.

  ID    DRUG      START        END
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>

Data Code:

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), DRUG = c("DPP4", "DPP4", 
"SU", "SU", "DPP4"), START = structure(c(18262, 18322, 18336, 
18536, 18597), class = "Date"), END = structure(c(18281, 18353, 
18382, 18566, 18627), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))

df_new <- structure(list(ID = 1:2, DRUG = c("DPP4-SU", NA), START = structure(c(18336, 
NA), class = "Date"), END = structure(c(18353, NA), class = "Date")), class = "data.frame", row.names = c(NA, 
-2L))

Edit: I think from the sample data I gave, it may seem that there can only be 1 intersecting interval. But there may be more. So, I think this would be better data to illustrate.

structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4", 
"DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004, 
17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate", 
"Date")), END = structure(c(17039, 17405, 17405, 17521, 17625, 
17669, 17711), class = c("IDate", "Date")), duration = c(35L, 
22L, 22L, 103L, 188L, 20L, 35L), INDEX = c(1L, 0L, 0L, 0L, 0L, 
0L, 0L)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", 
"data.frame"))
like image 781
HNSKD Avatar asked Aug 09 '21 08:08

HNSKD


People also ask

How do you know if two intervals are overlapping?

1) Sort all intervals in increasing order of start time. This step takes O(nLogn) time. 2) In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap.

What are overlapping intervals?

Let's take the following overlapping intervals example to explain the idea: If both ranges have at least one common point, then we say that they're overlapping. In other words, we say that two ranges and are overlapping if: On the other hand, non-overlapping ranges don't have any points in common.

How do you calculate overlapping time intervals in Java?

Java Program to find overlapping intervals among a given set of intervals. In this approach, we are going to take a count array and for each given interval start time we will do count[startTime]++ and for end time, do count[endTime]--. sum > 2 it means there is overlapping intervals.

What is the best way to find the overlap between intervals?

A Simple Solution is to consider every pair of intervals and check if the pair overlaps or not. A better solution is to Use Sorting. Following is complete algorithm. 1) Sort all intervals in increasing order of start time.

How to check if any two intervals intersect among a given set?

Check if any two intervals intersects among a given set of intervals 1 Sort all intervals in increasing order of start time. This step takes O (nLogn) time. 2 In the sorted array, if start time of an interval is less than end of previous interval, then there is an overlap. More ...

What is the time complexity of a simple interval finder?

A Simple Solution is to consider every pair of intervals and check if the pair overlaps or not. The time complexity of this solution is O (n 2 ). A better solution is to use Sorting. Following is the complete algorithm. 1) Sort all intervals in increasing order of start time.

What does the problem statement ask to check if intervals overlap?

The problem statement asks to check if any of the intervals overlap each other if overlaps then print ‘YES’ else print ‘NO’. Because, the time interval (2, 3) and (1, 4) overlaps with each other. The image clearly depicts that intervals (2,3) and (1,4) overlap. Because none of the intervals overlaps each other.


3 Answers

It's way more complicated than dear @AnoushiravanR's but as an alternative you could try

library(dplyr)
library(tidyr)
library(lubridate)

df %>% 
  full_join(x = ., y = ., by = "ID") %>% 
#  filter(DRUG.x != DRUG.y | START.x != START.y | END.x != END.y) %>% 
  filter(DRUG.x != DRUG.y) %>%
  group_by(ID, intersection = intersect(interval(START.x, END.x), interval(START.y, END.y))) %>% 
  drop_na(intersection) %>% 
  filter(START.x == first(START.x)) %>% 
  summarise(DRUG  = paste(DRUG.x, DRUG.y, sep = "-"),
            START = as_date(int_start(intersection)),
            END   = as_date(int_end(intersection)),
            .groups = "drop") %>% 
  select(-intersection)

returning

# A tibble: 1 x 4
     ID DRUG    START      END       
  <int> <chr>   <date>     <date>    
1     1 DPP4-SU 2020-03-15 2020-04-01

Edit: Changed the filter condition. The former one was flawed.

like image 184
Martin Gal Avatar answered Oct 25 '22 20:10

Martin Gal


Updated Solution I have made considerable modifications based on the newly provided data set. This time I first created interval for each START and END pair and extract the intersecting period between them. As dear Martin nicely made use of them we could use lubridate::int_start and lubridate::int_end to extract the START and END date of each interval:

library(dplyr)
library(lubridate)
library(purrr)
library(tidyr)

df %>%
  group_by(ID) %>%
  arrange(START, END) %>%
  mutate(int = interval(START, END),
         is_over = c(NA, map2(int[-n()], int[-1], 
                              ~ intersect(.x, .y)))) %>%
  unnest(cols = c(is_over)) %>% 
  select(-int) %>%
  filter(!is.na(is_over) | !is.na(lead(is_over))) %>%
  select(!c(START, END)) %>%
  mutate(grp = cumsum(is.na(is_over))) %>%
  group_by(grp) %>%
  summarise(ID = first(ID), 
            DRUG = paste0(DRUG, collapse = "-"), 
            is_over = na.omit(is_over)) %>%
  mutate(START = int_start(is_over), 
         END = int_end(is_over)) %>%
  select(!is_over)

# A tibble: 1 x 5
    grp    ID DRUG    START               END                
  <int> <int> <chr>   <dttm>              <dttm>             
1     1     1 DPP4-SU 2020-03-15 00:00:00 2020-04-01 00:00:00

Second data set:

# A tibble: 2 x 5
    grp    ID DRUG    START               END                
  <int> <dbl> <chr>   <dttm>              <dttm>             
1     1     3 DPP4-SU 2017-08-05 00:00:00 2017-08-27 00:00:00
2     2     3 SU-DPP4 2017-09-28 00:00:00 2017-12-21 00:00:00
like image 41
Anoushiravan R Avatar answered Oct 25 '22 20:10

Anoushiravan R


Update

As per updated df

df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c(
  "DPP4",
  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"
), START = structure(c(
  17004,
  17383, 17383, 17418, 17437, 17649, 17676
), class = c(
  "IDate",
  "Date"
)), END = structure(c(
  17039, 17405, 17405, 17521, 17625,
  17669, 17711
), class = c("IDate", "Date")), duration = c(
  35L,
  22L, 22L, 103L, 188L, 20L, 35L
), INDEX = c(
  1L, 0L, 0L, 0L, 0L,
  0L, 0L
)), row.names = c(NA, -7L), class = c(
  "tbl_df", "tbl",
  "data.frame"
))

we obtain

> dfnew
    ID    DRUG      start        end
3.3  3 DPP4-SU 2017-08-05 2017-08-27
3.7  3 SU-DPP4 2017-09-28 2017-12-21

A base R option (not as fancy as the answers by @Anoushiravan R or @Martin Gal)

f <- function(d) {
  d <- d[with(d, order(START, END)), ]
  idx <- subset(
    data.frame(which((u <- with(d, outer(START, END, `<`))) & t(u), arr.ind = TRUE)),
    row > col
  )
  if (nrow(idx) == 0) {
    return(data.frame(ID = unique(d$ID), DRUG = NA, start = NA, end = NA))
  }
  with(
    d,
    do.call(rbind, 
    apply(
      idx,
      1,
      FUN = function(v) {
        data.frame(
          ID = ID[v["row"]],
          DRUG = paste0(DRUG[sort(unlist(v))], collapse = "-"),
          start = START[v["row"]],
          end = END[v["col"]]
        )
      }
    ))
  )
}

dfnew <- do.call(rbind, Map(f, split(df, ~ID)))

gives

> dfnew
  ID    DRUG      start        end
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>
like image 5
ThomasIsCoding Avatar answered Oct 25 '22 20:10

ThomasIsCoding