Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to call a function that returns multiple rows and columns in a data.table?

I want to call a function inside a data.table that calculates a set of summary statistics like the following:

summ.stats <- function(vec) {
    list(
         Min = min(vec),
         Mean = mean(vec),
         S.D. = sd(vec),
         Median = median(vec),
         Max = max(vec))
}

and I want to call it in the j of a data.table:

DT <- data.table(a=c(1,2,3,1,2,3),b=c(1,4,3,2,1,4),c=c(2,3,4,5,2,1))

DT[, summ.stats(b), by=a]

This is fine and I get:

   a Min Mean      S.D. Median Max
1: 1   1  1.5 0.7071068    1.5   2
2: 2   1  2.5 2.1213203    2.5   4
3: 3   3  3.5 0.7071068    3.5   4

But I am interested in passing multiple variables to summ.stats. For example:

DT[, summ.stats(b, c), by=a]

I want to get something like:

   a Var Min Mean      S.D. Median Max
1: 1   b   1  1.5 0.7071068    1.5   2
2: 2   b   1  2.5 2.1213203    2.5   4
3: 3   b   3  3.5 0.7071068    3.5   4
4: 1   c   2  3.5 2.1213203    3.5   5
5: 2   c   2  2.5 0.7071068    2.5   3
6: 3   c   1  2.5 2.1213203    2.5   4

What is the best way to do this?

like image 545
Rodrigo Avatar asked Jul 28 '13 22:07

Rodrigo


2 Answers

Alternatively you can modify your function as follows:

summ.stats <- function(vec) {
    list(
        Var = names(vec),
         Min = sapply(vec, min),
         Mean = sapply(vec, mean),
         S.D. = sapply(vec, sd),
         Median = sapply(vec, median),
         Max = sapply(vec, max))
}

DT[, summ.stats(.SD), by=a] # no need for as.list(.SD) as Roger mentions
   a Var Min Mean      S.D. Median Max
1: 1   b   1  1.5 0.7071068    1.5   2
2: 1   c   2  3.5 2.1213203    3.5   5
3: 2   b   1  2.5 2.1213203    2.5   4
4: 2   c   2  2.5 0.7071068    2.5   3
5: 3   b   3  3.5 0.7071068    3.5   4
6: 3   c   1  2.5 2.1213203    2.5   4
like image 116
Arun Avatar answered Oct 22 '22 05:10

Arun


Without explicitly reshaping to long form, you could do something like

rbindlist(lapply(c('b','c'), function(x) data.table(var = x, DT[,summ.stats(get(x)),by=a])))



#    var a Min Mean      S.D. Median Max
# 1:   b 1   1  1.5 0.7071068    1.5   2
# 2:   b 2   1  2.5 2.1213203    2.5   4
# 3:   b 3   3  3.5 0.7071068    3.5   4
# 4:   c 1   2  3.5 2.1213203    3.5   5
# 5:   c 2   2  2.5 0.7071068    2.5   3
# 6:   c 3   1  2.5 2.1213203    2.5   4

If you reshape the data to long form

reshape(DT, direction = 'long', 
            varying = list(value = c('b','c')), 
            times = c('b','c'))[,summ.stats(b), by = list(a, Var = time)]

will work as well.


Less efficiently you could use ldplyfrom plyr, with a slight redefinition of the function

summ.stats2 <- function(vec) {
    data.table(
         Min = min(vec),
         Mean = mean(vec),
         S.D. = sd(vec),
         Median = median(vec),
         Max = max(vec))
}
library(plyr)
DT[, ldply(lapply(.SD, summ.stats2)),by =a]
like image 28
mnel Avatar answered Oct 22 '22 04:10

mnel