Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare consecutive rows in data.table and replace row values

Tags:

r

data.table

I have a data.table in R that contains multiple status values for each user collected at different time points. I want to compare the the status values at consecutive time points and update the rows with a flag whenever the status changes. Please see below for an example

DT_A <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("A","B","A","A","B","A","A"), Status2 = c("C","C","C","C","D","D","E"))
DT_A_Final <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("0","1","0","0","1","0","0"), Status2 = c("0","0","0","0","1","0","1"))

The original data table DT_A is

    sid date    Status1 Status2
1   1   2014-06-22  A   C
2   1   2014-06-23  B   C
3   2   2014-06-22  A   C
4   2   2014-06-23  A   C
5   2   2014-06-24  B   D
6   3   2014-06-22  A   D
7   3   2014-06-23  A   E

The final required data table is DT_A_final

    sid date    Status1 Status2
1   1   2014-06-22  0   0
2   1   2014-06-23  1   0
3   2   2014-06-22  0   0
4   2   2014-06-23  0   0
5   2   2014-06-24  1   1
6   3   2014-06-22  0   0
7   3   2014-06-23  0   1

Please help how I can I achieve this?

like image 615
user3750170 Avatar asked Jun 24 '14 19:06

user3750170


People also ask

How do I compare two consecutive rows in R?

diff() method in base R is used to find the difference among all the pairs of consecutive rows in the R dataframe.

How to take difference of rows in R?

The data frame indexing methods can be used to calculate the difference of rows by group in R. The 'by' attribute is to specify the column to group the data by. All the rows are retained, while a new column is added in the set of columns, using the column to take to compute the difference of rows by the group.


2 Answers

Here is an option:

DT_A[, 
  c("S1Change", "S2Change") := 
    lapply(.SD, function(x) c(0, head(x, -1L) != tail(x, -1L))),
  .SDcols=c("Status1", "Status2"),   # .SD contains just these columns
  by=sid
]

Here, we create two new columns, which we populate by lapply over .SD (defined to contain just Status1 and Status2). The function compares all but the first value of a column to all but the last of the same column. This will return TRUE any time there is change in a column. We add 0 at the beginning since the first value is never a change; this also coerces the result to a numeric vector (thanks eddi).

Then, we just by by sid, and voila:

   sid       date Status1 Status2 S1Change S2Change
1:   1 2014-06-22       A       C        0        0
2:   1 2014-06-23       B       C        1        0
3:   2 2014-06-22       A       C        0        0
4:   2 2014-06-23       A       C        0        0
5:   2 2014-06-24       B       D        1        1
6:   3 2014-06-22       A       D        0        0
7:   3 2014-06-23       A       E        0        1

You can easily subset this to drop the original status columns if you want. It isn't possible to re-use them because the data type of the result is different than the original (numeric vs. character).

like image 89
BrodieG Avatar answered Oct 06 '22 21:10

BrodieG


A dplyr approach would also work here. Start by creating a function to compare all elements in a vector to the first element, and then apply this to all the "Status" variables:

library(dplyr)
library(magrittr)

equal_first <- function(x) {
  x %>% equals(x[1]) %>% not %>% as.numeric
}

DT_A %>%
  group_by(sid) %>%
  mutate_each(funs(equal_first),starts_with("Status"))
  sid       date Status1 Status2
1   1 2014-06-22       0       0
2   1 2014-06-23       1       0
3   2 2014-06-22       0       0
4   2 2014-06-23       0       0
5   2 2014-06-24       1       1
6   3 2014-06-22       0       0
7   3 2014-06-23       0       1

If you have more than one status change per user, you want to compare to the previous value, not the first:

equal_prev <- function(x) {
  x %>% equals(lag(x, default = x[1])) %>% not %>% as.numeric
}

DT_A %>%
  group_by(sid) %>%
  mutate_each(funs(equal_prev),starts_with("Status"))
like image 36
AndrewMacDonald Avatar answered Oct 06 '22 20:10

AndrewMacDonald