Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Dynamically build "list" in data.table (or ddply)

My aggregation needs vary among columns / data.frames. I would like to pass the "list" argument to the data.table dynamically.

As a minimal example:

require(data.table)
type <- c(rep("hello", 3), rep("bye", 3), rep("ok",3))
a <- (rep(1:3, 3))
b <- runif(9)
c <- runif(9)
df <- data.frame(cbind(type, a, b, c), stringsAsFactors=F)
DT <-data.table(df)

This call:

DT[, list(suma = sum(as.numeric(a)), meanb = mean(as.numeric(b)), minc = min(as.numeric(c))), by= type]

will have result similar to this:

    type suma     meanb      minc
1: hello    6 0.1332210 0.4265579
2:   bye    6 0.5680839 0.2993667
3:    ok    6 0.5694532 0.2069026

Future data.frames will have more columns that I will want to summarize differently. But for the sake of working with this small example: Is there a way to pass the list programatically?

I naïvely tried:

# create a different list
mylist <- "list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))"
# new call
DT[, mylist, by=type]

With the following error:

1: hello
2:   bye
3:    ok
mylist
1: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
2: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
3: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))

Any hints appreciated! Best regards!

PS sorry about these as.numeric(), I could not quite figure out why, but I needed them for the example to run.

Minor edit inserted columns / before data.frame in initial sentence to clarify my needs.

like image 277
jjap Avatar asked Feb 06 '13 04:02

jjap


2 Answers

This is explained FAQ 1.6 what you are looking for is quote and eval

something like

 mycall <- quote(list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c))))

 DT[, eval(mycall)]

After a bit of head-banging, here is a very ugly way of constructing the call for ddply using .()

myplyrcall <- .(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))

do.call(ddply,c(.data = quote(DF), .variables = 'type',.fun = quote(summarise),myplyrcall))

You could also use as.quoted which has an as.quoted.character method to construct using paste0

myplc <-as.quoted(c("lengtha" = "length(as.numeric(a))", "maxb" = "max(as.numeric(b))", "meanc" = "mean(as.numeric(c))"))

This can be used with data.table as well!

dtcall <- as.quoted(mylist)[[1]]


DT[,eval(dtcall), by = type]

data.table all the way.

like image 64
mnel Avatar answered Sep 18 '22 05:09

mnel


Another way is to use .SDcols to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e to be summed by type where as, b,g should have mean taken and c,f its median, then,

# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9), 
                 b = rnorm(9), c=runif(9), d=sample(9), e=sample(9), 
                 f = runif(9), g=rnorm(9))

#     type a          b         c d e         f          g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3  1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597  1.7216593
# 4:   bye 8 -0.2260640 0.3924327 8 2 0.1271187  0.4360063
# 5:   bye 7 -1.0720503 0.3256450 7 8 0.5774691  0.7571990
# 6:   bye 5 -0.7131021 0.4855804 6 9 0.2687791  1.5398858
# 7:    ok 1 -0.4680549 0.8476840 2 4 0.5633317  1.5393945
# 8:    ok 4  0.4183264 0.4402595 4 1 0.7592801  2.1829996
# 9:    ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758

# 1) set key
setkey(dt, "type")

# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))

# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]

# 4) merge them.
dt1[dt2[dt3]]

#     type  a  d  e          b          g         c         f
# 1:   bye 20 21 19 -0.6704055  0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3:    ok 14 11 10 -0.5104907  0.9090061 0.5080116 0.5633317

If/when you have many many column, making a list like the one you've might be cumbersome.

like image 41
Arun Avatar answered Sep 22 '22 05:09

Arun