I try to achieve the same what dlply does with data.table
. So just as a very simple example:
library(plyr)
library(data.table)
dt <- data.table( p = c("A", "B"), q = 1:2 )
dlply( dt, "p", identity )
$A
p q
1 A 1
$B
p q
1 B 2
dt[ , identity(.SD), by = p ]
p q
1: A 1
2: B 2
foo <- function(x) as.list(x)
dt[ , foo(.SD), by = p ]
p q
1: A 1
2: B 2
Obviously the return values of foo
are collapsed to one data.table
. And I don't want to use dlply
because it passes the split data.tables
as data.frames
to foo
which makes further data.table operations within foo
inefficient.
Here's a more data.table
oriented approach:
setkey(dt, p)
dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1
#[[1]]
# p q
#1: A 1
#
#[[2]]
# p q
#1: B 2
There are more data.table
style alternatives to the above but that seems to be the fastest - here's a comparison with lapply
:
dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )
microbenchmark(dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1, lapply(unique(dt$p), function(x) dt[x]), times = 10)
#Unit: seconds
# expr min lq median uq max neval
#dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1 1.111385 1.508594 1.717357 1.966694 2.108188 10
# lapply(unique(dt$p), function(x) dt[x]) 1.871054 1.934865 2.216192 2.282428 2.367505 10
Try this:
> split(dt, dt[["p"]])
$A
p q
1: A 1
$B
p q
1: B 2
Regarding G. Grothendieck's answer I was curious how well split performs:
dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )
system.time(
ll <- split(dt, dt[ ,p ] )
)
user system elapsed
5.237 1.340 6.563
system.time(
ll <- lapply( unique(dt[,p]), function(x) dt[x] )
)
user system elapsed
1.179 0.363 1.541
So if there is no better answer, I'd stick with
lapply( unique(dt[,p]), function(x) dt[x] )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With