Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select first positive match per ID per date in R

Tags:

r

I have a dataframe with different observations over time. As soon as an ID has a positive value for "Match", the rows with the ID in the dates that follow has to be removed. This is an example dataframe:

      Date  ID  Match
2018-06-06  5    1
2018-06-06  6    0
2018-06-07  5    1
2018-06-07  6    0
2018-06-07  7    1
2018-06-08  5    0
2018-06-08  6    1
2018-06-08  7    1
2018-06-08  8    1

Desired output:

      Date  ID  Match
2018-06-06  5    1
2018-06-06  6    0
2018-06-07  6    0
2018-06-07  7    1
2018-06-08  6    1
2018-06-08  8    1

In other words, because ID=5 has a positive match on 2018-06-06, the rows with ID=5 are removed for the following days BUT the row with the first positive match for this ID is kept.

Reproducable example:

Date <- c("2018-06-06","2018-06-06","2018-06-07","2018-06-07","2018-06-07","2018-06-08","2018-06-08","2018-06-08","2018-06-08")
ID <- c(5,6,5,6,7,5,6,7,8)
Match <- c(1,0,1,0,1,0,1,1,1)
df <- data.frame(Date,ID,Match)

Thank you in advance

like image 760
olive Avatar asked Dec 10 '22 05:12

olive


1 Answers

One way:

library(data.table)
setDT(df)
df[, Match := as.integer(as.character(Match))] # fix bad format

df[, .SD[shift(cumsum(Match), fill=0) == 0], by=ID]

   ID       Date Match
1:  5 2018-06-06     1
2:  6 2018-06-06     0
3:  6 2018-06-07     0
4:  6 2018-06-08     1
5:  7 2018-06-07     1
6:  8 2018-06-08     1

We want to drop rows after the first Match == 1.

cumsum takes the cumulative sum of Match. It is zero until the first Match == 1. We want to keep the latter row and so check cumsum on the preceding row with shift.

like image 80
Frank Avatar answered Jan 07 '23 09:01

Frank