I have data of patient prescription of oral DM drugs, i.e. DPP4 and SU, and would like to find out if patients had taken the drugs concurrently (i.e. whether there are overlapping intervals for DPP4 and SU within the same patient <code>ID</code>). Sample data: <pre class="prettyprint"><code> ID DRUG START END 1 1 DPP4 2020-01-01 2020-01-20 2 1 DPP4 2020-03-01 2020-04-01 3 1 SU 2020-03-15 2020-04-30 4 2 SU 2020-10-01 2020-10-31 5 2 DPP4 2020-12-01 2020-12-31 </code></pre> In the sample data above, <ol> <li> <code>ID == 1</code>, patient had DPP4 and SU concurrently from <code>2020-03-15</code> to <code>2020-04-01</code>.</li> <li> <code>ID == 2</code>, patient had consumed both medications at separate intervals.</li> </ol> I thought of splitting the data into 2, one for DPP4 and another for SU. Then, do a full join, and compare each DPP4 interval with each SU interval. This may be okay for small data, but if a patient has like 5 rows for DPP4 and another 5 for SU, we will have 25 comparisons, which may not be efficient. Add that with 10000+ patients. I am not sure how to do it. New data: Hope to have a new df that looks like this. Or anything that is tidy. <pre class="prettyprint"><code> ID DRUG START END 1 1 DPP4-SU 2020-03-15 2020-04-01 2 2 <NA> <NA> <NA> </code></pre> Data Code: <pre class="prettyprint"><code>df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), DRUG = c("DPP4", "DPP4", "SU", "SU", "DPP4"), START = structure(c(18262, 18322, 18336, 18536, 18597), class = "Date"), END = structure(c(18281, 18353, 18382, 18566, 18627), class = "Date")), class = "data.frame", row.names = c(NA, -5L)) df_new <- structure(list(ID = 1:2, DRUG = c("DPP4-SU", NA), START = structure(c(18336, NA), class = "Date"), END = structure(c(18353, NA), class = "Date")), class = "data.frame", row.names = c(NA, -2L)) </code></pre> <hr> Edit: I think from the sample data I gave, it may seem that there can only be 1 intersecting interval. But there may be more. So, I think this would be better data to illustrate. <pre class="prettyprint"><code>structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4", "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004, 17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate", "Date")), END = structure(c(17039, 17405, 17405, 17521, 17625, 17669, 17711), class = c("IDate", "Date")), duration = c(35L, 22L, 22L, 103L, 188L, 20L, 35L), INDEX = c(1L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame")) </code></pre>

It's way more complicated than dear @AnoushiravanR's but as an alternative you could try <pre class="prettyprint lang-r prettyprint-override"><code>library(dplyr) library(tidyr) library(lubridate) df %>% full_join(x = ., y = ., by = "ID") %>% # filter(DRUG.x != DRUG.y | START.x != START.y | END.x != END.y) %>% filter(DRUG.x != DRUG.y) %>% group_by(ID, intersection = intersect(interval(START.x, END.x), interval(START.y, END.y))) %>% drop_na(intersection) %>% filter(START.x == first(START.x)) %>% summarise(DRUG = paste(DRUG.x, DRUG.y, sep = "-"), START = as_date(int_start(intersection)), END = as_date(int_end(intersection)), .groups = "drop") %>% select(-intersection) </code></pre> returning <pre class="prettyprint lang-r prettyprint-override"><code># A tibble: 1 x 4 ID DRUG START END <int> <chr> <date> <date> 1 1 DPP4-SU 2020-03-15 2020-04-01 </code></pre> Edit: Changed the filter condition. The former one was flawed.

Updated Solution I have made considerable modifications based on the newly provided data set. This time I first created interval for each <code>START</code> and <code>END</code> pair and extract the intersecting period between them. As dear Martin nicely made use of them we could use <code>lubridate::int_start</code> and <code>lubridate::int_end</code> to extract the <code>START</code> and <code>END</code> date of each interval: <pre class="prettyprint"><code>library(dplyr) library(lubridate) library(purrr) library(tidyr) df %>% group_by(ID) %>% arrange(START, END) %>% mutate(int = interval(START, END), is_over = c(NA, map2(int[-n()], int[-1], ~ intersect(.x, .y)))) %>% unnest(cols = c(is_over)) %>% select(-int) %>% filter(!is.na(is_over) | !is.na(lead(is_over))) %>% select(!c(START, END)) %>% mutate(grp = cumsum(is.na(is_over))) %>% group_by(grp) %>% summarise(ID = first(ID), DRUG = paste0(DRUG, collapse = "-"), is_over = na.omit(is_over)) %>% mutate(START = int_start(is_over), END = int_end(is_over)) %>% select(!is_over) # A tibble: 1 x 5 grp ID DRUG START END <int> <int> <chr> <dttm> <dttm> 1 1 1 DPP4-SU 2020-03-15 00:00:00 2020-04-01 00:00:00 </code></pre> Second data set: <pre class="prettyprint"><code># A tibble: 2 x 5 grp ID DRUG START END <int> <dbl> <chr> <dttm> <dttm> 1 1 3 DPP4-SU 2017-08-05 00:00:00 2017-08-27 00:00:00 2 2 3 SU-DPP4 2017-09-28 00:00:00 2017-12-21 00:00:00 </code></pre>

How can we check if any 2 intervals of a unique ID overlaps?

Tags:

performance

r

data-manipulation

dplyr

tidyr

I have data of patient prescription of oral DM drugs, i.e. DPP4 and SU, and would like to find out if patients had taken the drugs concurrently (i.e. whether there are overlapping intervals for DPP4 and SU within the same patient ID).

Sample data:

  ID DRUG      START        END
1  1 DPP4 2020-01-01 2020-01-20
2  1 DPP4 2020-03-01 2020-04-01
3  1   SU 2020-03-15 2020-04-30
4  2   SU 2020-10-01 2020-10-31
5  2 DPP4 2020-12-01 2020-12-31

In the sample data above,

ID == 1, patient had DPP4 and SU concurrently from 2020-03-15 to 2020-04-01.
ID == 2, patient had consumed both medications at separate intervals.

I thought of splitting the data into 2, one for DPP4 and another for SU. Then, do a full join, and compare each DPP4 interval with each SU interval. This may be okay for small data, but if a patient has like 5 rows for DPP4 and another 5 for SU, we will have 25 comparisons, which may not be efficient. Add that with 10000+ patients.

I am not sure how to do it.

New data:

Hope to have a new df that looks like this. Or anything that is tidy.

  ID    DRUG      START        END
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>

Data Code:

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), DRUG = c("DPP4", "DPP4", 
"SU", "SU", "DPP4"), START = structure(c(18262, 18322, 18336, 
18536, 18597), class = "Date"), END = structure(c(18281, 18353, 
18382, 18566, 18627), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))

df_new <- structure(list(ID = 1:2, DRUG = c("DPP4-SU", NA), START = structure(c(18336, 
NA), class = "Date"), END = structure(c(18353, NA), class = "Date")), class = "data.frame", row.names = c(NA, 
-2L))

Edit: I think from the sample data I gave, it may seem that there can only be 1 intersecting interval. But there may be more. So, I think this would be better data to illustrate.

structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c("DPP4", 
"DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"), START = structure(c(17004, 
17383, 17383, 17418, 17437, 17649, 17676), class = c("IDate", 
"Date")), END = structure(c(17039, 17405, 17405, 17521, 17625, 
17669, 17711), class = c("IDate", "Date")), duration = c(35L, 
22L, 22L, 103L, 188L, 20L, 35L), INDEX = c(1L, 0L, 0L, 0L, 0L, 
0L, 0L)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", 
"data.frame"))

781

asked Aug 09 '21 08:08

HNSKD

3 Answers

It's way more complicated than dear @AnoushiravanR's but as an alternative you could try

library(dplyr)
library(tidyr)
library(lubridate)

df %>% 
  full_join(x = ., y = ., by = "ID") %>% 
#  filter(DRUG.x != DRUG.y | START.x != START.y | END.x != END.y) %>% 
  filter(DRUG.x != DRUG.y) %>%
  group_by(ID, intersection = intersect(interval(START.x, END.x), interval(START.y, END.y))) %>% 
  drop_na(intersection) %>% 
  filter(START.x == first(START.x)) %>% 
  summarise(DRUG  = paste(DRUG.x, DRUG.y, sep = "-"),
            START = as_date(int_start(intersection)),
            END   = as_date(int_end(intersection)),
            .groups = "drop") %>% 
  select(-intersection)

returning

# A tibble: 1 x 4
     ID DRUG    START      END       
  <int> <chr>   <date>     <date>    
1     1 DPP4-SU 2020-03-15 2020-04-01

Edit: Changed the filter condition. The former one was flawed.

184

answered Oct 25 '22 20:10

Martin Gal

Updated Solution I have made considerable modifications based on the newly provided data set. This time I first created interval for each START and END pair and extract the intersecting period between them. As dear Martin nicely made use of them we could use lubridate::int_start and lubridate::int_end to extract the START and END date of each interval:

library(dplyr)
library(lubridate)
library(purrr)
library(tidyr)

df %>%
  group_by(ID) %>%
  arrange(START, END) %>%
  mutate(int = interval(START, END),
         is_over = c(NA, map2(int[-n()], int[-1], 
                              ~ intersect(.x, .y)))) %>%
  unnest(cols = c(is_over)) %>% 
  select(-int) %>%
  filter(!is.na(is_over) | !is.na(lead(is_over))) %>%
  select(!c(START, END)) %>%
  mutate(grp = cumsum(is.na(is_over))) %>%
  group_by(grp) %>%
  summarise(ID = first(ID), 
            DRUG = paste0(DRUG, collapse = "-"), 
            is_over = na.omit(is_over)) %>%
  mutate(START = int_start(is_over), 
         END = int_end(is_over)) %>%
  select(!is_over)

# A tibble: 1 x 5
    grp    ID DRUG    START               END                
  <int> <int> <chr>   <dttm>              <dttm>             
1     1     1 DPP4-SU 2020-03-15 00:00:00 2020-04-01 00:00:00

Second data set:

# A tibble: 2 x 5
    grp    ID DRUG    START               END                
  <int> <dbl> <chr>   <dttm>              <dttm>             
1     1     3 DPP4-SU 2017-08-05 00:00:00 2017-08-27 00:00:00
2     2     3 SU-DPP4 2017-09-28 00:00:00 2017-12-21 00:00:00

answered Oct 25 '22 20:10

Anoushiravan R

Update

As per updated df

df <- structure(list(ID = c(3, 3, 3, 3, 3, 3, 3), DRUG = c(
  "DPP4",
  "DPP4", "SU", "SU", "DPP4", "DPP4", "DPP4"
), START = structure(c(
  17004,
  17383, 17383, 17418, 17437, 17649, 17676
), class = c(
  "IDate",
  "Date"
)), END = structure(c(
  17039, 17405, 17405, 17521, 17625,
  17669, 17711
), class = c("IDate", "Date")), duration = c(
  35L,
  22L, 22L, 103L, 188L, 20L, 35L
), INDEX = c(
  1L, 0L, 0L, 0L, 0L,
  0L, 0L
)), row.names = c(NA, -7L), class = c(
  "tbl_df", "tbl",
  "data.frame"
))

we obtain

> dfnew
    ID    DRUG      start        end
3.3  3 DPP4-SU 2017-08-05 2017-08-27
3.7  3 SU-DPP4 2017-09-28 2017-12-21

A base R option (not as fancy as the answers by @Anoushiravan R or @Martin Gal)

f <- function(d) {
  d <- d[with(d, order(START, END)), ]
  idx <- subset(
    data.frame(which((u <- with(d, outer(START, END, `<`))) & t(u), arr.ind = TRUE)),
    row > col
  )
  if (nrow(idx) == 0) {
    return(data.frame(ID = unique(d$ID), DRUG = NA, start = NA, end = NA))
  }
  with(
    d,
    do.call(rbind, 
    apply(
      idx,
      1,
      FUN = function(v) {
        data.frame(
          ID = ID[v["row"]],
          DRUG = paste0(DRUG[sort(unlist(v))], collapse = "-"),
          start = START[v["row"]],
          end = END[v["col"]]
        )
      }
    ))
  )
}

dfnew <- do.call(rbind, Map(f, split(df, ~ID)))

gives

> dfnew
  ID    DRUG      start        end
1  1 DPP4-SU 2020-03-15 2020-04-01
2  2    <NA>       <NA>       <NA>

answered Oct 25 '22 20:10

ThomasIsCoding

Related questions
                            
                                Adding a legend to an rgl 3d plot
                            
                                Efficient linked list (ordered set) in R
                            
                                R stargazer, lme4 and lmerTest incompatibility
                            
                                Rmarkdown removes citation hyperlink
                            
                                Fill different colors for each quantile in geom_density() of ggplot [duplicate]
                            
                                Breaking out of nested loops in R
                            
                                Disabling buttons in Shiny
                            
                                R: What do you call the :: and ::: operators and how do they differ?
                            
                                Apply a dataframe of boolean values onto another dataframe in R
                            
                                R script in Power BI returns date as Microsoft.OleDb.Date
                            
                                using purrr to affect single columns of each dataframe in a list
                            
                                how to change the color in geom_point or lines in ggplot [duplicate]
                            
                                How to use the Google satellite view as tile in leaflet with R
                            
                                Saving ggplot graph to PDF with fonts embedded in r
                            
                                Rscript detect if R script is being called/sourced from another script
                            
                                Merge and Perfectly Align Histogram and Boxplot using ggplot2
                            
                                How to download entire repository from Github using R?
                            
                                dplyr rowwise sum and other functions like max
                            
                                Summing values in R based on column value with dplyr
                            
                                Reading shape file with sf::st_read fails to capture encoding UTF8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can we check if any 2 intervals of a unique ID overlaps?

Tags:

performance

r

data-manipulation

dplyr

tidyr

HNSKD

People also ask

3 Answers

Martin Gal

Anoushiravan R

Update

ThomasIsCoding

Recent Activity

Donate For Us