I have data that looks like this:
ID CLASS START END
100 GA 3-Jan-15 1-Feb-15
100 G 1-Feb-15 22-Feb-15
100 GA 28-Feb-15 17-Mar-15
100 G 1-Apr-15 8-Apr-15
100 G 10-Apr-15 18-Apr-15
200 FA 3-Jan-14 1-Feb-14
200 FA 1-Feb-14 22-Feb-14
200 G 28-Feb-14 15-Mar-14
200 F 1-Apr-14 20-Apr-14
Here is the data:
df <- structure(list(ID = c(100L, 100L, 100L, 100L, 100L, 200L, 200L,
200L, 200L), CLASS = structure(c(4L, 3L, 4L, 3L, 3L, 2L, 2L,
3L, 1L), .Label = c("F", "FA", "G", "GA"), class = "factor"),
START = structure(c(9L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 1L), .Label = c("1-Apr-14",
"1-Apr-15", "1-Feb-14", "1-Feb-15", "10-Apr-15", "28-Feb-14",
"28-Feb-15", "3-Jan-14", "3-Jan-15"), class = "factor"),
END = structure(c(2L, 8L, 4L, 9L, 5L, 1L, 7L, 3L, 6L), .Label = c("1-Feb-14",
"1-Feb-15", "15-Mar-14", "17-Mar-15", "18-Apr-15", "20-Apr-14",
"22-Feb-14", "22-Feb-15", "8-Apr-15"), class = "factor")), .Names = c("ID",
"CLASS", "START", "END"), class = "data.frame", row.names = c(NA,
-9L))
I would like to group the data by the ID column and then consolidate any consecutive occurrences of the same value in the CLASS column (sorted by the START date), while selecting the minimum start date and the maximum end date. So for ID number 100, there is only one instance where the "G" class is consecutive, so I would like to consolidate those two rows into a single row with the min(START) and max(END) dates. This is a simple example but in the real data sometimes there are several consecutive rows that need to be consolidated.
I have tried group_by followed by using some kind of ranking but this doesn't seem to do the trick. Any suggestions on how to solve this? Also this is the first time I am posting on SO, so I hope this question makes sense.
Result should look like this:
ID CLASS START END
100 GA 3-Jan-15 1-Feb-15
100 G 1-Feb-15 22-Feb-15
100 GA 28-Feb-15 17-Mar-15
100 G 1-Apr-15 18-Apr-15
200 FA 3-Jan-14 22-Feb-14
200 G 28-Feb-14 15-Mar-14
200 F 1-Apr-14 20-Apr-14
Here's an option, using data.table::rleid
to make an id for runs of the same ID
and CLASS
:
# make START and END Date class for easier manipulation
df <- df %>% mutate(START = as.Date(START, '%d-%b-%y'),
END = as.Date(END, '%d-%b-%y'))
# More concise alternative:
# df <- df %>% mutate_each(funs(as.Date(., '%d-%b-%y')), START, END)
# group and make rleid as mentioned above
df %>% group_by(ID, CLASS, rleid = data.table::rleid(ID, CLASS)) %>%
# collapse with summarise, replacing START and END with their min and max for each group
summarise(START = min(START), END = max(END)) %>%
# clean up arrangement and get rid of added rleid column
ungroup() %>% arrange(rleid) %>% select(-rleid)
# Source: local data frame [7 x 4]
#
# ID CLASS START END
# (int) (fctr) (date) (date)
# 1 100 GA 2015-01-03 2015-02-01
# 2 100 G 2015-02-01 2015-02-22
# 3 100 GA 2015-02-28 2015-03-17
# 4 100 G 2015-04-01 2015-04-18
# 5 200 FA 2014-01-03 2014-02-22
# 6 200 G 2014-02-28 2014-03-15
# 7 200 F 2014-04-01 2014-04-20
Here's the pure data.table analogue:
library(data.table)
setDT(df)
datecols = c("START","END")
df[, (datecols) := lapply(.SD, as.IDate, format = '%d-%b-%y'), .SDcols = datecols]
df[, .(START = START[1L], END = END[.N]), by=.(ID, CLASS, r = rleid(ID, CLASS))][, r := NULL][]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With