I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.
This example demonstrates the solution I have at the moment:
N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
X Y
1: 1 -1.37925206
2: 1 -0.53837461
3: 2 0.26516340
4: 2 -0.04643483
5: 3 0.40331424
6: 3 0.28667275
7: 4 -0.30342327
8: 4 -2.13143267
9: 5 2.11178673
10: 5 -0.98047230
11: 6 -0.27230783
12: 6 -0.79540934
13: 7 1.54264549
14: 7 0.40079650
15: 8 -0.98474297
16: 8 0.73179201
17: 9 -0.34590491
18: 9 -0.55897393
19: 10 0.97523187
20: 10 1.16924293
> DT[, .SD[Y == max(Y)], by = X]
X Y
1: 1 -0.5383746
2: 2 0.2651634
3: 3 0.4033142
4: 4 -0.3034233
5: 5 2.1117867
6: 6 -0.2723078
7: 7 1.5426455
8: 8 0.7317920
9: 9 -0.3459049
10: 10 1.1692429
The problem is that for larger data.tables this take a very long time:
N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
user system elapsed
9.69 0.00 9.69
My actual table about 100 million rows...
Can anyone suggest a more efficient solution?
Edit - importance of set key
The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:
See Example without "each" in rep:
N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
X Y
1: 1 1.26925708
2: 4 -0.66625732
3: 5 0.41498548
4: 8 0.03531185
5: 9 0.30608380
6: 1 0.50308578
7: 4 0.19848227
8: 6 0.86458423
9: 8 0.69825500
10: 10 -0.38160503
This would be faster compared to .SD
system.time({setkey(DT, X)
DT[DT[,Y==max(Y), by=X]$V1,]})
# user system elapsed
#0.016 0.000 0.016
Or
system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1])
# user system elapsed
# 0.023 0.000 0.023
If there are only two columns,
system.time(DT[,list(Y=max(Y)), by=X])
# user system elapsed
# 0.006 0.000 0.007
Compared to,
system.time(DT[, .SD[Y == max(Y)], by = X] )
# user system elapsed
# 2.946 0.006 2.962
Based on comments from @Khashaa, @AnandaMahto, the CRAN version (1.9.4
) gives a different result for the .SD
method compared to devel version (1.9.5
) (which I used). You could get the same result for "CRAN" version (from @Arun's comments) by setting the options
options(datatable.auto.index=FALSE)
NOTE: In case of "ties", the solutions described here will return multiple rows for each group (as mentioned by @docendo discimus). My solutions are based on the "code" posted by the OP.
If there are "ties", then you could use unique
with by
option (in case the number of columns are >
2)
setkey(DT,X)
unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y"))
library(microbenchmark)
f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]}
f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]}
f3 <- function(){DT[, list(Y=max(Y)), by=X]}
f4 <- function(){DT[, .SD[Y==max(Y)], by=X]}
microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# f1() 2.794435 2.733706 3.024097 2.756398 2.832654 6.697893 20
# f2() 4.302534 4.291715 4.535051 4.271834 4.342437 8.114811 20
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691 20
# cld
# a
# a
# a
# b
N = 10000
k = 25
set.seed(25)
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With