Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copy rows inside data.table based on condition

Tags:

r

data.table

My sample data.table looks as follows

Starting Data

library(data.table)
x <- data.table(id = as.character(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)),
            time = as.character(c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)),
            status = c("P", "R", "R", "R", "R", "P", "P", "P", "R", "R", "P", "P", "R", "R", "R"),
            balance = c(100, 90, 80, 70, 60, 320, 300, 250, 200, 100, 40, 34, 31, 29, 10),
            employment = c("Y", "Y", "Y", "N", "N", "N", "N", "N", "N", "Y", "N", "Y", "Y", "Y", "Y")
)

The objective is to copy the information in the columns balance and employment if the status migrates from "P" to "R" by "id". I.e. I would like to use the data in the last period in which a id is "P" and overwrite the existing information for all next periods in which the id is "R".

Hence, the goal is to obtain this data.table

Goal

Y <- data.table(id = as.character(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)),
            time = as.character(c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)),
            status =     c("P", "R", "R", "R", "R", "P", "P", "P", "R", "R", "P", "P", "R", "R", "R"),
            balance = c(100, 100, 100, 100, 100, 320, 300, 250, 250, 250, 40, 34, 34, 34, 34),
            employment = c("Y", "Y", "Y", "Y", "Y", "N", "N", "N", "N", "N", "N", "Y", "Y", "Y", "Y")
)

Note that the columns time and status itselft (and of course id) are not affected.

I have tried to use seq_len across id's, then set this column to zero if status is "R and search for the max value (by id) of this column to use as an indicator which row has to be copied. I'm sure there is a faster and better way to resolve this. Maybe even an one-liner.

If anything is unclear please let me know

like image 994
Dave van Brecht Avatar asked Feb 18 '16 08:02

Dave van Brecht


People also ask

How do I copy a DataTable?

To enable copying and pasting DataTable data by the CTRL+C/CTRL+V keyboard shortcuts you should use the clipboard parameter. The parameter can have the following values: "block" (also set as true);


1 Answers

One way to do this is to set the desired columns to NAs when status == R and then carry the last observation forward (LOCF), since all the ids start with P, I don't think you really need to do this by id and hence improve performance. Here's a way

## Define column names you want to modify
cols <- c("balance", "employment")

## Assign `NA`s when satus == "R"
x[status == "R", (cols) := NA]

## Carry the last observation forward and update by reference
library(zoo) ## You could do this with base R too, just more writing
x[, (cols) := lapply(.SD, na.locf), .SDcols = cols]
like image 58
David Arenburg Avatar answered Sep 27 '22 21:09

David Arenburg