Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select first row in each contiguous run by group

I have data which is grouped by 'ID'. Each 'ID' has different drugs at different dates. Within each consecutive run of 'drug', I would like to keep only the first row. This should be done by group, i.e. within each 'ID'. Two examples are shown in the data:

ID        date    drug  
 1  01/01/2020       A # first row in run 1 of 'A' for ID 1: keep 
 1  07/01/2020       A # 2nd row in run 1 of 'A' for ID 1: drop
 1  09/01/2020       B
 1  15/01/2020       A
 2  01/02/2020       C 
 2  13/02/2020       D
 2  17/02/2020       C # first row in run 2 of 'C' of ID 2: keep 
 2  18/03/2020       C # 2nd row in run 2 of 'C' of ID 2: drop 
 2  19/03/2020       E

Desired output:

ID     date             drug  
1      01/01/2020        A
1      09/01/2020        B
1      15/01/2020        A
2      01/02/2020        C
2      13/02/2020        D
2      17/02/2020        C
2      19/03/2020        E

I have tried the following but I cannot make it work since it'll remove those drugs that are from the same group but appear later on e.g. it'd drop 15/01/2020, 17/02/2020 and 18/03/2020 since it only takes the first observation by group.

df_selection <- df %>%   
  group_by(ID) %>% 
  arrange(ID,date) %>% 
  group_by(ID, drug) %>% 
  slice(1L) %>% 
  arrange(ID,date)

I have tried many combinations but I cannot make it work. I'd really appreciate some help!


An additional example to demonstrate a case where the last 'drug' in one 'ID' is the same as the first in the next 'ID', here drug 'B':

ID       date drug
 1 01/01/2020    A
 1 07/01/2020    A
 1 09/01/2020    B # first row in a run of 'B' for ID 1: keep 
 1 15/01/2020    B # 2nd row in a run of 'B' for ID 1: drop 
 2 01/02/2020    B # first row in a run of 'B' for ID 2: keep 
 2 13/02/2020    B # 2nd: drop
 2 17/02/2020    B # 3rd: drop
 2 18/03/2020    E
 2 19/03/2020    E
like image 919
adrian1121 Avatar asked Nov 17 '25 09:11

adrian1121


2 Answers

df %>% filter(drug != lag(drug, default = ""))

Or, if you want to keep first appearance of a drug for one ID even if it matches the last drug for the prior ID (e.g. let's say ID2's first drug was A and we therefore wanted to keep it.):

df %>%
  filter(drug != lag(drug, default = "") |
           ID != lag(ID, default = 0))
like image 180
Jon Spring Avatar answered Nov 19 '25 03:11

Jon Spring


Using data.table:

setDT(df)[rowid(rleid(drug)) == 1]
#    ID       date drug
# 1:  1 01/01/2020    A
# 2:  1 09/01/2020    B
# 3:  1 15/01/2020    A
# 4:  2 01/02/2020    C
# 5:  2 13/02/2020    D
# 6:  2 17/02/2020    C
# 7:  2 19/03/2020    E

If the runs of 'drug' should be considered within each 'ID' we need...

df[rowid(rleid(ID, drug)) == 1]

...to handle cases like:

   ID       date drug
1:  1 01/01/2020    A
2:  1 07/01/2020    A
3:  1 09/01/2020    B
4:  1 15/01/2020    B # This 'B' belongs to 2nd run in ID 1 
5:  2 01/02/2020    B # This 'B' belongs to 1st run in ID 2
6:  2 13/02/2020    B
7:  2 17/02/2020    B
8:  2 18/03/2020    E
9:  2 19/03/2020    E
like image 40
Henrik Avatar answered Nov 19 '25 03:11

Henrik