I want to call a function inside a data.table that calculates a set of summary statistics like the following:
summ.stats <- function(vec) {
list(
Min = min(vec),
Mean = mean(vec),
S.D. = sd(vec),
Median = median(vec),
Max = max(vec))
}
and I want to call it in the j
of a data.table
:
DT <- data.table(a=c(1,2,3,1,2,3),b=c(1,4,3,2,1,4),c=c(2,3,4,5,2,1))
DT[, summ.stats(b), by=a]
This is fine and I get:
a Min Mean S.D. Median Max
1: 1 1 1.5 0.7071068 1.5 2
2: 2 1 2.5 2.1213203 2.5 4
3: 3 3 3.5 0.7071068 3.5 4
But I am interested in passing multiple variables to summ.stats. For example:
DT[, summ.stats(b, c), by=a]
I want to get something like:
a Var Min Mean S.D. Median Max
1: 1 b 1 1.5 0.7071068 1.5 2
2: 2 b 1 2.5 2.1213203 2.5 4
3: 3 b 3 3.5 0.7071068 3.5 4
4: 1 c 2 3.5 2.1213203 3.5 5
5: 2 c 2 2.5 0.7071068 2.5 3
6: 3 c 1 2.5 2.1213203 2.5 4
What is the best way to do this?
Alternatively you can modify your function as follows:
summ.stats <- function(vec) {
list(
Var = names(vec),
Min = sapply(vec, min),
Mean = sapply(vec, mean),
S.D. = sapply(vec, sd),
Median = sapply(vec, median),
Max = sapply(vec, max))
}
DT[, summ.stats(.SD), by=a] # no need for as.list(.SD) as Roger mentions
a Var Min Mean S.D. Median Max
1: 1 b 1 1.5 0.7071068 1.5 2
2: 1 c 2 3.5 2.1213203 3.5 5
3: 2 b 1 2.5 2.1213203 2.5 4
4: 2 c 2 2.5 0.7071068 2.5 3
5: 3 b 3 3.5 0.7071068 3.5 4
6: 3 c 1 2.5 2.1213203 2.5 4
Without explicitly reshaping to long form, you could do something like
rbindlist(lapply(c('b','c'), function(x) data.table(var = x, DT[,summ.stats(get(x)),by=a])))
# var a Min Mean S.D. Median Max
# 1: b 1 1 1.5 0.7071068 1.5 2
# 2: b 2 1 2.5 2.1213203 2.5 4
# 3: b 3 3 3.5 0.7071068 3.5 4
# 4: c 1 2 3.5 2.1213203 3.5 5
# 5: c 2 2 2.5 0.7071068 2.5 3
# 6: c 3 1 2.5 2.1213203 2.5 4
If you reshape
the data to long form
reshape(DT, direction = 'long',
varying = list(value = c('b','c')),
times = c('b','c'))[,summ.stats(b), by = list(a, Var = time)]
will work as well.
Less efficiently you could use ldply
from plyr, with a slight redefinition of the function
summ.stats2 <- function(vec) {
data.table(
Min = min(vec),
Mean = mean(vec),
S.D. = sd(vec),
Median = median(vec),
Max = max(vec))
}
library(plyr)
DT[, ldply(lapply(.SD, summ.stats2)),by =a]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With