Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent for dlply in data.table

Tags:

r

data.table

I try to achieve the same what dlply does with data.table. So just as a very simple example:

library(plyr)
library(data.table)
dt <- data.table( p = c("A", "B"), q = 1:2 )

dlply( dt, "p", identity )
$A
  p q
1 A 1

$B
  p q
1 B 2

dt[ , identity(.SD), by = p ]
   p q
1: A 1
2: B 2

foo <- function(x) as.list(x)
dt[ , foo(.SD), by = p ]
   p q
1: A 1
2: B 2

Obviously the return values of foo are collapsed to one data.table. And I don't want to use dlply because it passes the split data.tables as data.frames to foo which makes further data.table operations within foo inefficient.

like image 445
Beasterfield Avatar asked May 22 '13 09:05

Beasterfield


3 Answers

Here's a more data.table oriented approach:

setkey(dt, p)
dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1
#[[1]]
#   p q
#1: A 1
#
#[[2]]
#   p q
#1: B 2

There are more data.table style alternatives to the above but that seems to be the fastest - here's a comparison with lapply:

dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )
microbenchmark(dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1, lapply(unique(dt$p), function(x) dt[x]), times = 10)
#Unit: seconds
#                                        expr      min       lq   median       uq      max neval
#dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1 1.111385 1.508594 1.717357 1.966694 2.108188    10
#     lapply(unique(dt$p), function(x) dt[x]) 1.871054 1.934865 2.216192 2.282428 2.367505    10
like image 81
eddi Avatar answered Nov 05 '22 05:11

eddi


Try this:

> split(dt, dt[["p"]])
$A
   p q
1: A 1

$B
   p q
1: B 2
like image 29
G. Grothendieck Avatar answered Nov 05 '22 06:11

G. Grothendieck


Regarding G. Grothendieck's answer I was curious how well split performs:

dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )

system.time(
  ll <- split(dt, dt[ ,p ] )
)
  user  system elapsed 
  5.237   1.340   6.563 

system.time(
  ll <- lapply( unique(dt[,p]), function(x) dt[x] )
)
  user  system elapsed 
  1.179   0.363   1.541

So if there is no better answer, I'd stick with

lapply( unique(dt[,p]), function(x) dt[x] )
like image 2
Beasterfield Avatar answered Nov 05 '22 06:11

Beasterfield