I am trying to apply multiple functions to multiple columns by a grouping variable. I can get the results, but not in a useful format. Below, I would like res2 to be an extension of res1 by the by variable "cyl" and as many rows as unique values of cyl.
I have tried omitting unlist and redefining the my.sum.function to return a numeric rather than a list. But I can't get the format I need.
library(data.table)
## The well known data 
data(mtcars)
DT <- data.table(mtcars)
## a custom set of summary functions
my.sum.fun = function(x){list(
    mean   = mean(x, na.rm=T),
    median = median(x, na.rm=T),
    sd     = sd(x, na.rm=T)
    )}
## I can summarize multiple columns. This works
res1 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp")]
res1
 mpg.mean mpg.median     mpg.sd    hp.mean  hp.median      hp.sd 
 20.090625  19.200000   6.026948 146.687500 123.000000  68.562868 
## Now I add a by column. What I would like is the format as res1 but with the by column "cyl" added and with as many rows as unique values of "cyl".
res2 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp"),by=list(cyl)]
res2
    cyl         V1
 1:   6  19.742857
 2:   6  19.700000
 3:   6   1.453567
 4:   6 122.285714
 5:   6 110.000000
 6:   6  24.260491
 7:   4  26.663636
 8:   4  26.000000
 9:   4   4.509828
10:   4  82.636364
11:   4  91.000000
12:   4  20.934530
13:   8  15.100000
14:   8  15.200000
15:   8   2.560048
16:   8 209.214286
17:   8 192.500000
18:   8  50.976886
                There is an option in unlist to avoid unlisting recursively - the recursive parameter  (By default, the recursive = TRUE)
DT[,unlist(lapply(.SD,my.sum.fun), 
      recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
#   cyl mpg.mean mpg.median   mpg.sd   hp.mean hp.median    hp.sd
#1:   6 19.74286       19.7 1.453567 122.28571     110.0 24.26049
#2:   4 26.66364       26.0 4.509828  82.63636      91.0 20.93453
#3:   8 15.10000       15.2 2.560048 209.21429     192.5 50.97689
                        I realize it may seem a little silly to use dplyr within data.table, but I don't think summarize_all is any slower than lapply and this will still let you take advantage of data table's fast grouping, etc.
library(dplyr)
my_funs <- list(
    mean   = function(x) mean(x, na.rm=T),
    median = function(x) median(x, na.rm=T),
    sd     = function(x) sd(x, na.rm=T)
  )
dt[, summarise_all(.SD, my_funs), .SDcols = c("mpg", "hp"), by = 'cyl']
#    cyl mpg_mean   hp_mean mpg_median hp_median   mpg_sd    hp_sd
# 1:   6 19.74286 122.28571       19.7     110.0 1.453567 24.26049
# 2:   4 26.66364  82.63636       26.0      91.0 4.509828 20.93453
# 3:   8 15.10000 209.21429       15.2     192.5 2.560048 50.97689
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With