Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

retaining column names in lapply(.SD,...) for data.table R

Tags:

r

data.table

When applying a function with multiple output variables (e.g., a list) to a subset of a data.table, I lose the variable names. Is there a way to retain them?

library(data.table)

foo <- function(x){
  list(mn = mean(x), sd = sd(x))
}

bar <- data.table(x=1:8, y=c("d","e","f","g"))

# column names "mn" and "sd" are replaced by "V1" and "V2"
bar[, sapply(.SD, foo), by = y, .SDcols="x"]

# column names "mn" and "sd" are retained
bar_split <- split(bar$x, bar$y)
t(sapply(bar_split, foo))
like image 746
Bryan Avatar asked Apr 27 '15 21:04

Bryan


2 Answers

The setNames function lets you add back the missing character vector.:

bar[, setNames( sapply(.SD, foo), c("mn", "sd")), by = y, .SDcols="x"]
   y mn       sd
1: d  3 2.828427
2: e  4 2.828427
3: f  5 2.828427
4: g  6 2.828427

The authors suggested using the other form suggested by Arenburg:

DT[, c('x2', 'y2') := list(x / sum(x), y / sum(y)), by = grp]
like image 23
IRTFM Avatar answered Sep 20 '22 08:09

IRTFM


I would go wit the following, which is a bit awkward, but doesn't require writing the names manually no matter how many functions there are

bar[, as.list(unlist(lapply(.SD, foo))), by = y, .SDcols = "x"]
#    y x.mn     x.sd
# 1: d    3 2.828427
# 2: e    4 2.828427
# 3: f    5 2.828427
# 4: g    6 2.828427

The biggest advantage of this approach is that it binds the functions with the column names. If, for example, you would have an additional column, it will still give an informative result while using the same code as above

set.seed(1)
bar[, z := sample(8)]
bar[, as.list(unlist(lapply(.SD, foo))), by = y, .SDcols = c("x", "z")]
#    y x.mn     x.sd z.mn      z.sd
# 1: d    3 2.828427  2.0 1.4142136
# 2: e    4 2.828427  7.5 0.7071068
# 3: f    5 2.828427  3.0 1.4142136
# 4: g    6 2.828427  5.5 0.7071068
like image 84
David Arenburg Avatar answered Sep 18 '22 08:09

David Arenburg