Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table update last element in group based on condition

Tags:

r

data.table

I have a data.table with 3 columns: id, time and status. For each id, I want to find the record with the maximum time - then if for that record, the status is true, I want to set it to false if the time is > 7 (for example). I am doing it in the following manner.

x <- data.table(id=c(1,1,2,2),time=c(5,6,7,8),status=c(FALSE,TRUE,FALSE,TRUE))
setkey(x,id,time)
y <- x[,.SD[.N],by=id]
x[y,status:=status & time > 7]

I have a lot of data I am working with and would like to speed up this operation. Any suggestions would be appreciated!

like image 890
user2506086 Avatar asked Mar 03 '16 00:03

user2506086


Video Answer


3 Answers

x[x[,.N, by=id][,cumsum(N)], status := status * time <=7]

If i am not mistaken, this is no join as x[,.N, by=id][,cumsum(N)] returns the row-indices of the last elements per group.

Update: After seeing the speed comparison this one seems the winner and should be listed first

This was my initial attempt which turns out to be the slowest of all suggested solutions

x[,status := c(.SD[-.N, status], .SD[.N, status * time <=7]), by=id]
like image 189
Rentrop Avatar answered Oct 22 '22 06:10

Rentrop


One data.table approach is

x[ x[order(time), .I[.N], by=id]$V1 , status := ifelse(time > 7, FALSE, TRUE)]

> x
#   id time status
#1:  1    5  FALSE
#2:  1    6   TRUE
#3:  2    7  FALSE
#4:  2    8  FALSE

as x[order(time), .I[.N], by=id]$V1 gives us the row index of the maximum time for each group (id)

And, borrowing from @Floo0's answer we can simplify it slightly to

x[ x[order(time), .I[.N], by=id]$V1 , status := status * time <= 7]

Speed Comparison

A speed test of the various answers (and keeping the keys on the data)

set.seed(123)
x <- data.table(id=c(rep(seq(1:10000), each=10)),
                time=c(rep(seq(1:10000), 10)),
                status=c(sample(c(TRUE, FALSE), 10000*10, replace=T)))
setkey(x,id,time)
x1 <- copy(x); x2 <- copy(x); x3 <- copy(x); x4 <- copy(x); x5 <- copy(x); x6 <- copy(x)

library(microbenchmark)

microbenchmark(

    Symbolix = {x1[ x1[order(time), .I[.N], by=id]$V1 , status := status * time < 7 ] },

    Floo0_1 = {x2[,status := c(.SD[-.N, status], .SD[.N, status * time > 7]), by=id]},

    Floo0_2 = {x3[x3[,.N, by=id][,cumsum(N)], status := status * time > 7]},

    Original = { 
                y <- x4[,.SD[.N],by=id]
                x4[y,status:=status & time > 7]
               },

    Frank = {
             y <- x5[, .SD[.N, .(time, status)], by=id][time > 7 & status]
             x5[y, status := FALSE]
             },

    thelatemail = {x6[ x6[,.I==.I[which.max(time)], by=id]$V1 & time > 7, status := FALSE]}
)

Unit: milliseconds
        expr         min          lq        mean      median          uq         max neval cld
    Symbolix    5.419768    5.857477    6.514111    6.222118    6.936000   11.284580   100 a  
     Floo0_1 4550.314775 4710.679867 4787.086279 4776.794072 4850.334011 5097.136148   100   c
     Floo0_2    1.653419    1.792378    1.945203    1.881609    2.014325    4.096006   100 a  
    Original   10.052947   10.986294   12.541595   11.431182   12.391287   89.494783   100 a  
       Frank 4609.115061 4697.687642 4743.886186 4735.086113 4785.212543 4932.270602   100  b 
 thelatemail   10.300864   11.594972   12.421889   12.315852   12.984146   17.630736   100 a  
like image 20
SymbolixAU Avatar answered Oct 22 '22 07:10

SymbolixAU


Another attempt:

x[ x[,.I==.I[which.max(time)], by=id]$V1 & time > 7, status := FALSE]
x

#   id time status
#1:  1    5  FALSE
#2:  1    6   TRUE
#3:  2    7  FALSE
#4:  2    8  FALSE
like image 26
thelatemail Avatar answered Oct 22 '22 06:10

thelatemail