Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dplyr or data.table consolidate consecutive rows within grouped data based on value in another column

I have data that looks like this:

ID  CLASS   START   END
100 GA  3-Jan-15    1-Feb-15
100 G   1-Feb-15    22-Feb-15
100 GA  28-Feb-15   17-Mar-15
100 G   1-Apr-15    8-Apr-15
100 G   10-Apr-15   18-Apr-15
200 FA  3-Jan-14    1-Feb-14
200 FA  1-Feb-14    22-Feb-14
200 G   28-Feb-14   15-Mar-14
200 F   1-Apr-14    20-Apr-14

Here is the data:

df <- structure(list(ID = c(100L, 100L, 100L, 100L, 100L, 200L, 200L, 
200L, 200L), CLASS = structure(c(4L, 3L, 4L, 3L, 3L, 2L, 2L, 
3L, 1L), .Label = c("F", "FA", "G", "GA"), class = "factor"), 
START = structure(c(9L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 1L), .Label = c("1-Apr-14", 
"1-Apr-15", "1-Feb-14", "1-Feb-15", "10-Apr-15", "28-Feb-14", 
"28-Feb-15", "3-Jan-14", "3-Jan-15"), class = "factor"), 
END = structure(c(2L, 8L, 4L, 9L, 5L, 1L, 7L, 3L, 6L), .Label = c("1-Feb-14", 
"1-Feb-15", "15-Mar-14", "17-Mar-15", "18-Apr-15", "20-Apr-14", 
"22-Feb-14", "22-Feb-15", "8-Apr-15"), class = "factor")), .Names = c("ID", 
"CLASS", "START", "END"), class = "data.frame", row.names = c(NA, 
-9L))

I would like to group the data by the ID column and then consolidate any consecutive occurrences of the same value in the CLASS column (sorted by the START date), while selecting the minimum start date and the maximum end date. So for ID number 100, there is only one instance where the "G" class is consecutive, so I would like to consolidate those two rows into a single row with the min(START) and max(END) dates. This is a simple example but in the real data sometimes there are several consecutive rows that need to be consolidated.

I have tried group_by followed by using some kind of ranking but this doesn't seem to do the trick. Any suggestions on how to solve this? Also this is the first time I am posting on SO, so I hope this question makes sense.

Result should look like this:

ID  CLASS   START   END
100 GA  3-Jan-15    1-Feb-15
100 G   1-Feb-15    22-Feb-15
100 GA  28-Feb-15   17-Mar-15
100 G   1-Apr-15    18-Apr-15
200 FA  3-Jan-14    22-Feb-14
200 G   28-Feb-14   15-Mar-14
200 F   1-Apr-14    20-Apr-14
like image 244
Kartik Avatar asked Apr 28 '16 19:04

Kartik


1 Answers

Here's an option, using data.table::rleid to make an id for runs of the same ID and CLASS:

# make START and END Date class for easier manipulation
df <- df %>% mutate(START = as.Date(START, '%d-%b-%y'), 
                    END = as.Date(END, '%d-%b-%y'))
# More concise alternative: 
# df <- df %>% mutate_each(funs(as.Date(., '%d-%b-%y')), START, END)

# group and make rleid as mentioned above
df %>% group_by(ID, CLASS, rleid = data.table::rleid(ID, CLASS)) %>% 
    # collapse with summarise, replacing START and END with their min and max for each group
    summarise(START = min(START), END = max(END)) %>% 
    # clean up arrangement and get rid of added rleid column
    ungroup() %>% arrange(rleid) %>% select(-rleid)

# Source: local data frame [7 x 4]
# 
#      ID  CLASS      START        END
#   (int) (fctr)     (date)     (date)
# 1   100     GA 2015-01-03 2015-02-01
# 2   100      G 2015-02-01 2015-02-22
# 3   100     GA 2015-02-28 2015-03-17
# 4   100      G 2015-04-01 2015-04-18
# 5   200     FA 2014-01-03 2014-02-22
# 6   200      G 2014-02-28 2014-03-15
# 7   200      F 2014-04-01 2014-04-20

Here's the pure data.table analogue:

library(data.table)
setDT(df)
datecols = c("START","END")
df[, (datecols) := lapply(.SD, as.IDate, format = '%d-%b-%y'), .SDcols = datecols]

df[, .(START = START[1L], END = END[.N]), by=.(ID, CLASS, r = rleid(ID, CLASS))][, r := NULL][]
like image 186
alistaire Avatar answered Sep 23 '22 00:09

alistaire