Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select one row from each group in a large data.table based on a condition [duplicate]

Tags:

r

data.table

I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.

This example demonstrates the solution I have at the moment:

N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
     X           Y
 1:  1 -1.37925206
 2:  1 -0.53837461
 3:  2  0.26516340
 4:  2 -0.04643483
 5:  3  0.40331424
 6:  3  0.28667275
 7:  4 -0.30342327
 8:  4 -2.13143267
 9:  5  2.11178673
10:  5 -0.98047230
11:  6 -0.27230783
12:  6 -0.79540934
13:  7  1.54264549
14:  7  0.40079650
15:  8 -0.98474297
16:  8  0.73179201
17:  9 -0.34590491
18:  9 -0.55897393
19: 10  0.97523187
20: 10  1.16924293
> DT[, .SD[Y == max(Y)], by = X]
     X          Y
 1:  1 -0.5383746
 2:  2  0.2651634
 3:  3  0.4033142
 4:  4 -0.3034233
 5:  5  2.1117867
 6:  6 -0.2723078
 7:  7  1.5426455
 8:  8  0.7317920
 9:  9 -0.3459049
10: 10  1.1692429

The problem is that for larger data.tables this take a very long time:

N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
   user  system elapsed 
   9.69    0.00    9.69 

My actual table about 100 million rows...

Can anyone suggest a more efficient solution?


Edit - importance of set key

The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:

See Example without "each" in rep:

N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
     X           Y
 1:  1  1.26925708
 2:  4 -0.66625732
 3:  5  0.41498548
 4:  8  0.03531185
 5:  9  0.30608380
 6:  1  0.50308578
 7:  4  0.19848227
 8:  6  0.86458423
 9:  8  0.69825500
10: 10 -0.38160503
like image 586
Corvus Avatar asked Jan 09 '15 14:01

Corvus


1 Answers

This would be faster compared to .SD

 system.time({setkey(DT, X)
    DT[DT[,Y==max(Y), by=X]$V1,]})
  # user  system elapsed 
  #0.016   0.000   0.016 

Or

system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1])
#  user  system elapsed 
# 0.023   0.000   0.023 

If there are only two columns,

system.time(DT[,list(Y=max(Y)), by=X])
#   user  system elapsed 
#  0.006   0.000   0.007 

Compared to,

system.time(DT[, .SD[Y == max(Y)], by = X] )
#  user  system elapsed 
# 2.946   0.006   2.962 

Based on comments from @Khashaa, @AnandaMahto, the CRAN version (1.9.4) gives a different result for the .SD method compared to devel version (1.9.5) (which I used). You could get the same result for "CRAN" version (from @Arun's comments) by setting the options

 options(datatable.auto.index=FALSE)

NOTE: In case of "ties", the solutions described here will return multiple rows for each group (as mentioned by @docendo discimus). My solutions are based on the "code" posted by the OP.

If there are "ties", then you could use unique with by option (in case the number of columns are > 2)

 setkey(DT,X)
 unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y"))

microbenchmarks

library(microbenchmark)
f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]}
f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]}
f3 <- function(){DT[, list(Y=max(Y)), by=X]}
f4 <- function(){DT[, .SD[Y==max(Y)], by=X]}
microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L)
#Unit: relative
# expr        min         lq       mean     median         uq        max neval
# f1()   2.794435   2.733706   3.024097   2.756398   2.832654   6.697893    20
# f2()   4.302534   4.291715   4.535051   4.271834   4.342437   8.114811    20
# f3()   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000    20
# f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691    20
# cld
#  a 
#  a 
#  a 
#  b

data

N = 10000
k = 25
set.seed(25)
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
like image 135
akrun Avatar answered Oct 19 '22 03:10

akrun