Compare consecutive rows in data.table and replace row values

Tags:

I have a data.table in R that contains multiple status values for each user collected at different time points. I want to compare the the status values at consecutive time points and update the rows with a flag whenever the status changes. Please see below for an example

DT_A <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("A","B","A","A","B","A","A"), Status2 = c("C","C","C","C","D","D","E"))
DT_A_Final <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("0","1","0","0","1","0","0"), Status2 = c("0","0","0","0","1","0","1"))

The original data table DT_A is

    sid date    Status1 Status2
1   1   2014-06-22  A   C
2   1   2014-06-23  B   C
3   2   2014-06-22  A   C
4   2   2014-06-23  A   C
5   2   2014-06-24  B   D
6   3   2014-06-22  A   D
7   3   2014-06-23  A   E

The final required data table is DT_A_final

    sid date    Status1 Status2
1   1   2014-06-22  0   0
2   1   2014-06-23  1   0
3   2   2014-06-22  0   0
4   2   2014-06-23  0   0
5   2   2014-06-24  1   1
6   3   2014-06-22  0   0
7   3   2014-06-23  0   1

Please help how I can I achieve this?

615

asked Jun 24 '14 19:06

user3750170

2 Answers

Here is an option:

DT_A[, 
  c("S1Change", "S2Change") := 
    lapply(.SD, function(x) c(0, head(x, -1L) != tail(x, -1L))),
  .SDcols=c("Status1", "Status2"),   # .SD contains just these columns
  by=sid
]

Here, we create two new columns, which we populate by lapply over .SD (defined to contain just Status1 and Status2). The function compares all but the first value of a column to all but the last of the same column. This will return TRUE any time there is change in a column. We add 0 at the beginning since the first value is never a change; this also coerces the result to a numeric vector (thanks eddi).

Then, we just by by sid, and voila:

   sid       date Status1 Status2 S1Change S2Change
1:   1 2014-06-22       A       C        0        0
2:   1 2014-06-23       B       C        1        0
3:   2 2014-06-22       A       C        0        0
4:   2 2014-06-23       A       C        0        0
5:   2 2014-06-24       B       D        1        1
6:   3 2014-06-22       A       D        0        0
7:   3 2014-06-23       A       E        0        1

You can easily subset this to drop the original status columns if you want. It isn't possible to re-use them because the data type of the result is different than the original (numeric vs. character).

answered Oct 06 '22 21:10

BrodieG

A dplyr approach would also work here. Start by creating a function to compare all elements in a vector to the first element, and then apply this to all the "Status" variables:

library(dplyr)
library(magrittr)

equal_first <- function(x) {
  x %>% equals(x[1]) %>% not %>% as.numeric
}

DT_A %>%
  group_by(sid) %>%
  mutate_each(funs(equal_first),starts_with("Status"))
  sid       date Status1 Status2
1   1 2014-06-22       0       0
2   1 2014-06-23       1       0
3   2 2014-06-22       0       0
4   2 2014-06-23       0       0
5   2 2014-06-24       1       1
6   3 2014-06-22       0       0
7   3 2014-06-23       0       1

If you have more than one status change per user, you want to compare to the previous value, not the first:

equal_prev <- function(x) {
  x %>% equals(lag(x, default = x[1])) %>% not %>% as.numeric
}

DT_A %>%
  group_by(sid) %>%
  mutate_each(funs(equal_prev),starts_with("Status"))

answered Oct 06 '22 20:10

AndrewMacDonald

Related questions
                            
                                How to merge overlapping integer vector elements of a list in R
                            
                                Histogram in R combining first two values
                            
                                How to convert an portion of an XML into a data frame? (properly)
                            
                                Connecting points to regression line in plot
                            
                                Point color (col) and fill color (bg) by group in stripchart
                            
                                How to avoid printing line numbers with data.table?
                            
                                Efficient versions of any/all
                            
                                Compute data.frame column averages by date
                            
                                Why can't this R call object in an expression be evaluated? (subsetting vs extracting from a call object)
                            
                                How to avoid unlist() modification of list naming
                            
                                How to use more than one expression in a row
                            
                                R - add transparency to colorRampPalette
                            
                                Count the number of overlapping substrings within a string
                            
                                Preallocation in r
                            
                                Why is.vector on a data-frame doesn't return TRUE?
                            
                                Rewriting loops with apply functions
                            
                                R Get Column Names from data.frame
                            
                                Pattern replace in R
                            
                                Replace NA with last non-NA in data.table by using only data.table
                            
                                make sure graphics device gets closed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With