I have data which is grouped by 'ID'. Each 'ID' has different drugs at different dates. Within each consecutive run of 'drug', I would like to keep only the first row. This should be done by group, i.e. within each 'ID'. Two examples are shown in the data:
ID date drug
1 01/01/2020 A # first row in run 1 of 'A' for ID 1: keep
1 07/01/2020 A # 2nd row in run 1 of 'A' for ID 1: drop
1 09/01/2020 B
1 15/01/2020 A
2 01/02/2020 C
2 13/02/2020 D
2 17/02/2020 C # first row in run 2 of 'C' of ID 2: keep
2 18/03/2020 C # 2nd row in run 2 of 'C' of ID 2: drop
2 19/03/2020 E
Desired output:
ID date drug
1 01/01/2020 A
1 09/01/2020 B
1 15/01/2020 A
2 01/02/2020 C
2 13/02/2020 D
2 17/02/2020 C
2 19/03/2020 E
I have tried the following but I cannot make it work since it'll remove those drugs that are from the same group but appear later on e.g. it'd drop 15/01/2020, 17/02/2020 and 18/03/2020 since it only takes the first observation by group.
df_selection <- df %>%
group_by(ID) %>%
arrange(ID,date) %>%
group_by(ID, drug) %>%
slice(1L) %>%
arrange(ID,date)
I have tried many combinations but I cannot make it work. I'd really appreciate some help!
An additional example to demonstrate a case where the last 'drug' in one 'ID' is the same as the first in the next 'ID', here drug 'B':
ID date drug
1 01/01/2020 A
1 07/01/2020 A
1 09/01/2020 B # first row in a run of 'B' for ID 1: keep
1 15/01/2020 B # 2nd row in a run of 'B' for ID 1: drop
2 01/02/2020 B # first row in a run of 'B' for ID 2: keep
2 13/02/2020 B # 2nd: drop
2 17/02/2020 B # 3rd: drop
2 18/03/2020 E
2 19/03/2020 E
df %>% filter(drug != lag(drug, default = ""))
Or, if you want to keep first appearance of a drug for one ID even if it matches the last drug for the prior ID (e.g. let's say ID2's first drug was A and we therefore wanted to keep it.):
df %>%
filter(drug != lag(drug, default = "") |
ID != lag(ID, default = 0))
Using data.table:
setDT(df)[rowid(rleid(drug)) == 1]
# ID date drug
# 1: 1 01/01/2020 A
# 2: 1 09/01/2020 B
# 3: 1 15/01/2020 A
# 4: 2 01/02/2020 C
# 5: 2 13/02/2020 D
# 6: 2 17/02/2020 C
# 7: 2 19/03/2020 E
If the runs of 'drug' should be considered within each 'ID' we need...
df[rowid(rleid(ID, drug)) == 1]
...to handle cases like:
ID date drug
1: 1 01/01/2020 A
2: 1 07/01/2020 A
3: 1 09/01/2020 B
4: 1 15/01/2020 B # This 'B' belongs to 2nd run in ID 1
5: 2 01/02/2020 B # This 'B' belongs to 1st run in ID 2
6: 2 13/02/2020 B
7: 2 17/02/2020 B
8: 2 18/03/2020 E
9: 2 19/03/2020 E
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With