I need to programmatically apply different functions to different columns and group by, using data.table
.
If the columns and functions were known, I would do like this:
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
v1 = rep(c(2, 3, 4), each=3),
v2 = rep(c(5, 10, 15), each=3))
DT
#> id v1 v2
#> 1: a 2 5
#> 2: a 2 5
#> 3: a 2 5
#> 4: b 3 10
#> 5: b 3 10
#> 6: b 3 10
#> 7: c 4 15
#> 8: c 4 15
#> 9: c 4 15
DT[, .(v1=mean(v1), v2=sum(v2)), keyby=.(id)]
#> id v1 v2
#> 1: a 2 15
#> 2: b 3 30
#> 3: c 4 45
But I want to do this by passing the column names and their specific function:
aggregate_functions = list(v1=mean, v2=sum)
col_selection = c('v1', 'v2')
I wrote something like this by I can't figure out a way of passing the column name to lapply
:
DT[, lapply(.SD,
aggregate_functions[[col_name]] # some way of selecting the right function from aggregate_functions
),
.SDcols = col_selection,
by=id]
I have also tried with melt
and dcast
, but the latter applies all the functions to all the columns:
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
v1 = rep(c(2, 3, 4), each=3),
v2 = rep(c(5, 10, 15), each=3))
DTm = melt(DT, meaure.vars=col_selection, id.vars='id')
DTm
#> id variable value
#> 1: a v1 2
#> 2: a v1 2
#> 3: a v1 2
#> 4: b v1 3
#> 5: b v1 3
#> 6: b v1 3
#> 7: c v1 4
#> 8: c v1 4
#> 9: c v1 4
#> 10: a v2 5
#> 11: a v2 5
#> 12: a v2 5
#> 13: b v2 10
#> 14: b v2 10
#> 15: b v2 10
#> 16: c v2 15
#> 17: c v2 15
#> 18: c v2 15
DTc = dcast(DTm, id~variable, fun.aggregate=list(sum, mean))
DTc
#> id value_sum_v1 value_sum_v2 value_mean_v1 value_mean_v2
#> 1: a 6 15 2 5
#> 2: b 9 30 3 10
#> 3: c 12 45 4 15
I could programmatically select and rename the relevant columns (3 and 4 in this case) but it doesn't look like an efficient approach.
Of course I could have a loop doing the job and merging the results, but I am looking for a data.table
way.
Thank you for your answer and thank you to the team at data.table
.
Created on 2019-11-26 by the reprex package (v0.3.0)
An option is to use mapply
:
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id,
.SDcols=col_selection]
Need to careful on the ordering of col_selection
and aggregate_functions
so that the right function is applied to the right column.
output:
id v1 v2
1: a 2 15
2: b 3 30
3: c 4 45
Edit from the OP:
Just to complete this brilliant solution.
This solution works very well and if we replace col_selection
with names(aggregate_functions)
there is no issue with the ordering. Plus it automatically discards all the columns that are not in the list:
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
v1 = rep(c(2, 3, 4), each=3),
v2 = rep(c(5, 10, 15), each=3),
id2 = c(rep(c('cc', 'dd'), 4), 'dd')
)
aggregate_functions = list(v1=mean, v2=sum)
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id,
.SDcols=names(aggregate_functions)]
#> id v1 v2
#> 1: a 2 15
#> 2: b 3 30
#> 3: c 4 45
It is also possible to use multiple variables to aggregate by, by passing a list:
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), list(id, id2),
.SDcols=names(aggregate_functions)]
#> id id2 v1 v2
#> 1: a cc 2 10
#> 2: a dd 2 5
#> 3: b dd 3 20
#> 4: b cc 3 10
#> 5: c cc 4 15
#> 6: c dd 4 30
Created on 2019-11-27 by the reprex package (v0.3.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With