I have a data.table in R that contains multiple status values for each user collected at different time points. I want to compare the the status values at consecutive time points and update the rows with a flag whenever the status changes. Please see below for an example
DT_A <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("A","B","A","A","B","A","A"), Status2 = c("C","C","C","C","D","D","E"))
DT_A_Final <- data.table(sid=c(1,1,2,2,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22","2014-06-23")), Status1 = c("0","1","0","0","1","0","0"), Status2 = c("0","0","0","0","1","0","1"))
The original data table DT_A is
sid date Status1 Status2
1 1 2014-06-22 A C
2 1 2014-06-23 B C
3 2 2014-06-22 A C
4 2 2014-06-23 A C
5 2 2014-06-24 B D
6 3 2014-06-22 A D
7 3 2014-06-23 A E
The final required data table is DT_A_final
sid date Status1 Status2
1 1 2014-06-22 0 0
2 1 2014-06-23 1 0
3 2 2014-06-22 0 0
4 2 2014-06-23 0 0
5 2 2014-06-24 1 1
6 3 2014-06-22 0 0
7 3 2014-06-23 0 1
Please help how I can I achieve this?
diff() method in base R is used to find the difference among all the pairs of consecutive rows in the R dataframe.
The data frame indexing methods can be used to calculate the difference of rows by group in R. The 'by' attribute is to specify the column to group the data by. All the rows are retained, while a new column is added in the set of columns, using the column to take to compute the difference of rows by the group.
Here is an option:
DT_A[,
c("S1Change", "S2Change") :=
lapply(.SD, function(x) c(0, head(x, -1L) != tail(x, -1L))),
.SDcols=c("Status1", "Status2"), # .SD contains just these columns
by=sid
]
Here, we create two new columns, which we populate by lapply
over .SD
(defined to contain just Status1
and Status2
). The function compares all but the first value of a column to all but the last of the same column. This will return TRUE any time there is change in a column. We add 0 at the beginning since the first value is never a change; this also coerces the result to a numeric vector (thanks eddi).
Then, we just by
by sid
, and voila:
sid date Status1 Status2 S1Change S2Change
1: 1 2014-06-22 A C 0 0
2: 1 2014-06-23 B C 1 0
3: 2 2014-06-22 A C 0 0
4: 2 2014-06-23 A C 0 0
5: 2 2014-06-24 B D 1 1
6: 3 2014-06-22 A D 0 0
7: 3 2014-06-23 A E 0 1
You can easily subset this to drop the original status columns if you want. It isn't possible to re-use them because the data type of the result is different than the original (numeric vs. character).
A dplyr
approach would also work here. Start by creating a function to compare all elements in a vector to the first element, and then apply this to all the "Status" variables:
library(dplyr)
library(magrittr)
equal_first <- function(x) {
x %>% equals(x[1]) %>% not %>% as.numeric
}
DT_A %>%
group_by(sid) %>%
mutate_each(funs(equal_first),starts_with("Status"))
sid date Status1 Status2
1 1 2014-06-22 0 0
2 1 2014-06-23 1 0
3 2 2014-06-22 0 0
4 2 2014-06-23 0 0
5 2 2014-06-24 1 1
6 3 2014-06-22 0 0
7 3 2014-06-23 0 1
If you have more than one status change per user, you want to compare to the previous value, not the first:
equal_prev <- function(x) {
x %>% equals(lag(x, default = x[1])) %>% not %>% as.numeric
}
DT_A %>%
group_by(sid) %>%
mutate_each(funs(equal_prev),starts_with("Status"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With