I have a data.table and I am trying to do something akin to data[ !is.na(variable) ]
. However, for groups that are entirely missing, I'd like to just keep the first row of that group. So, I am trying to subset using by. I have done some research online and have a solution, but I think it is inefficient.
I've provided an example below showing what I am hoping to achieve, and I wonder if this can be done without creating the two extra columns.
d_sample = data.table( ID = c(1, 1, 2, 2, 3, 3),
Time = c(10, 15, 100, 110, 200, 220),
Event = c(NA, NA, NA, 1, 1, NA))
d_sample[ !is.na(Event), isValidOutcomeRow := T, by = ID]
d_sample[ , isValidOutcomePatient := any(isValidOutcomeRow), by = ID]
d_sample[ is.na(isValidOutcomePatient), isValidOutcomeRow := c(T, rep(NA, .N - 1)), by = ID]
d_sample[ isValidOutcomeRow == T ]
EDIT: Here are some speed comparisons with thelatemail and Frank's solutions with a larger dataset with 60K rows.
d_sample = data.table( ID = sort(rep(seq(1,30000), 2)),
Time = rep(c(10, 15, 100, 110, 200, 220), 10000),
Event = rep(c(NA, NA, NA, 1, 1, NA), 10000) )
thelatemail's solution gets a runtime of 20.65
on my computer.
system.time(d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)][1], by=ID])
Frank's first solution gets a runtime of 0
system.time( unique( d_sample[order(is.na(Event))], by="ID" ) )
Frank's second solution gets a runtime of 0.05
system.time( d_sample[order(is.na(Event)), .SD[1L], by=ID] )
This seems to work:
unique( d_sample[order(is.na(Event))], by="ID" )
ID Time Event
1: 2 110 1
2: 3 200 1
3: 1 10 NA
Alternately, d_sample[order(is.na(Event)), .SD[1L], by=ID]
.
Extending the OP's example, I also find similar timings for the two approaches:
n = 12e4 # must be a multiple of 6
set.seed(1)
d_sample = data.table( ID = sort(rep(seq(1,n/2), 2)),
Time = rep(c(10, 15, 100, 110, 200, 220), n/6),
Event = rep(c(NA, NA, NA, 1, 1, NA), n/6) )
system.time(rf <- unique( d_sample[order(is.na(Event))], by="ID" ))
# 1.17
system.time(rf2 <- d_sample[order(is.na(Event)), .SD[1L], by=ID] )
# 1.24
system.time(rt <- d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)], by=ID])
# 10.42
system.time(rt2 <-
d_sample[ d_sample[, { w = which(is.na(Event)); .I[ if (length(w) == .N) 1L else -w ] }, by=ID]$V1 ]
)
# .13
# verify
identical(rf,rf2) # TRUE
identical(rf,rt) # FALSE
fsetequal(rf,rt) # TRUE
identical(rt,rt2) # TRUE
The variation on @thelatemail's solution rt2
is the fastest by a wide margin.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With