Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply multiple functions to multiple columns within by? [duplicate]

Tags:

r

data.table

I am trying to apply multiple functions to multiple columns by a grouping variable. I can get the results, but not in a useful format. Below, I would like res2 to be an extension of res1 by the by variable "cyl" and as many rows as unique values of cyl.

I have tried omitting unlist and redefining the my.sum.function to return a numeric rather than a list. But I can't get the format I need.

library(data.table)

## The well known data 
data(mtcars)
DT <- data.table(mtcars)

## a custom set of summary functions
my.sum.fun = function(x){list(
    mean   = mean(x, na.rm=T),
    median = median(x, na.rm=T),
    sd     = sd(x, na.rm=T)
    )}

## I can summarize multiple columns. This works
res1 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp")]
res1
 mpg.mean mpg.median     mpg.sd    hp.mean  hp.median      hp.sd 
 20.090625  19.200000   6.026948 146.687500 123.000000  68.562868 

## Now I add a by column. What I would like is the format as res1 but with the by column "cyl" added and with as many rows as unique values of "cyl".
res2 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp"),by=list(cyl)]
res2
    cyl         V1
 1:   6  19.742857
 2:   6  19.700000
 3:   6   1.453567
 4:   6 122.285714
 5:   6 110.000000
 6:   6  24.260491
 7:   4  26.663636
 8:   4  26.000000
 9:   4   4.509828
10:   4  82.636364
11:   4  91.000000
12:   4  20.934530
13:   8  15.100000
14:   8  15.200000
15:   8   2.560048
16:   8 209.214286
17:   8 192.500000
18:   8  50.976886
like image 425
John Doe Avatar asked Jul 01 '19 19:07

John Doe


2 Answers

There is an option in unlist to avoid unlisting recursively - the recursive parameter (By default, the recursive = TRUE)

DT[,unlist(lapply(.SD,my.sum.fun), 
      recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
#   cyl mpg.mean mpg.median   mpg.sd   hp.mean hp.median    hp.sd
#1:   6 19.74286       19.7 1.453567 122.28571     110.0 24.26049
#2:   4 26.66364       26.0 4.509828  82.63636      91.0 20.93453
#3:   8 15.10000       15.2 2.560048 209.21429     192.5 50.97689
like image 163
akrun Avatar answered Sep 30 '22 12:09

akrun


I realize it may seem a little silly to use dplyr within data.table, but I don't think summarize_all is any slower than lapply and this will still let you take advantage of data table's fast grouping, etc.

library(dplyr)

my_funs <- list(
    mean   = function(x) mean(x, na.rm=T),
    median = function(x) median(x, na.rm=T),
    sd     = function(x) sd(x, na.rm=T)
  )

dt[, summarise_all(.SD, my_funs), .SDcols = c("mpg", "hp"), by = 'cyl']

#    cyl mpg_mean   hp_mean mpg_median hp_median   mpg_sd    hp_sd
# 1:   6 19.74286 122.28571       19.7     110.0 1.453567 24.26049
# 2:   4 26.66364  82.63636       26.0      91.0 4.509828 20.93453
# 3:   8 15.10000 209.21429       15.2     192.5 2.560048 50.97689
like image 36
IceCreamToucan Avatar answered Sep 30 '22 12:09

IceCreamToucan