I am trying to apply multiple functions to multiple columns by a grouping variable. I can get the results, but not in a useful format. Below, I would like res2 to be an extension of res1 by the by variable "cyl" and as many rows as unique values of cyl.
I have tried omitting unlist and redefining the my.sum.function to return a numeric rather than a list. But I can't get the format I need.
library(data.table)
## The well known data
data(mtcars)
DT <- data.table(mtcars)
## a custom set of summary functions
my.sum.fun = function(x){list(
mean = mean(x, na.rm=T),
median = median(x, na.rm=T),
sd = sd(x, na.rm=T)
)}
## I can summarize multiple columns. This works
res1 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp")]
res1
mpg.mean mpg.median mpg.sd hp.mean hp.median hp.sd
20.090625 19.200000 6.026948 146.687500 123.000000 68.562868
## Now I add a by column. What I would like is the format as res1 but with the by column "cyl" added and with as many rows as unique values of "cyl".
res2 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp"),by=list(cyl)]
res2
cyl V1
1: 6 19.742857
2: 6 19.700000
3: 6 1.453567
4: 6 122.285714
5: 6 110.000000
6: 6 24.260491
7: 4 26.663636
8: 4 26.000000
9: 4 4.509828
10: 4 82.636364
11: 4 91.000000
12: 4 20.934530
13: 8 15.100000
14: 8 15.200000
15: 8 2.560048
16: 8 209.214286
17: 8 192.500000
18: 8 50.976886
There is an option in unlist
to avoid unlisting recursively - the recursive
parameter (By default, the recursive = TRUE
)
DT[,unlist(lapply(.SD,my.sum.fun),
recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
# cyl mpg.mean mpg.median mpg.sd hp.mean hp.median hp.sd
#1: 6 19.74286 19.7 1.453567 122.28571 110.0 24.26049
#2: 4 26.66364 26.0 4.509828 82.63636 91.0 20.93453
#3: 8 15.10000 15.2 2.560048 209.21429 192.5 50.97689
I realize it may seem a little silly to use dplyr within data.table, but I don't think summarize_all
is any slower than lapply
and this will still let you take advantage of data table's fast grouping, etc.
library(dplyr)
my_funs <- list(
mean = function(x) mean(x, na.rm=T),
median = function(x) median(x, na.rm=T),
sd = function(x) sd(x, na.rm=T)
)
dt[, summarise_all(.SD, my_funs), .SDcols = c("mpg", "hp"), by = 'cyl']
# cyl mpg_mean hp_mean mpg_median hp_median mpg_sd hp_sd
# 1: 6 19.74286 122.28571 19.7 110.0 1.453567 24.26049
# 2: 4 26.66364 82.63636 26.0 91.0 4.509828 20.93453
# 3: 8 15.10000 209.21429 15.2 192.5 2.560048 50.97689
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With