I'm trying to simplify this data.table two-stage process which acts on both numeric and character variables. E.g. - take the first element of textvar and sum each of the numeric variables. Consider this small example:
library(data.table)
dt <- data.table(grpvar=letters[c(1,1,2)], textvar=c("one","two","one"),
numvar=1:3, othernum=2:4)
dt
# grpvar textvar numvar othernum
#1: a one 1 2
#2: a two 2 3
#3: b one 3 4
Now my first thought was to nest .SD to drop the one variable out of the lapply call, but I thought that was a bit complicated:
dt[, c(textvar=textvar[1], .SD[, lapply(.SD, sum), .SDcols=-c("textvar")]), by=grpvar]
# grpvar textvar numvar othernum
#1: a one 3 5
#2: b one 3 4
Then I thought maybe I could do each grouping separately and join them, but that seems even worse:
dt[, .(textvar=textvar[1]), by=grpvar][
dt[, lapply(.SD, sum), by=grpvar, .SDcols=-c("textvar")], on="grpvar"
]
# grpvar textvar numvar othernum
#1: a one 3 5
#2: b one 3 4
Is there a simpler construction that would get around the nesting of .SD or the joining? I feel like I'm overlooking something elementary.
The j-argument in data.table is (deliberately) quite flexible. All we need to remember is that:
As long as
jreturns a list, each element of the list will become a column in the resulting data.table.
Using the fact that c(list, list) is a list, we can construct the expression as follows:
dt[, c(textvar = textvar[1L], lapply(.SD, sum)), # select/compute all cols necessary
.SDcols = numvar:othernum, # provide .SD's columns
by = grpvar] # group by 'grpvar'
# grpvar textvar numvar othernum
# 1: a one 3 5
# 2: b one 3 4
Here, I've not wrapped the first expression with list() since textvar[1L] returns a length=1 vector.. i.e., identical(c(1, list(2, 3)), c(list(1), list(2,3))) is TRUE.
Note that this is only possible from v1.9.7. The bug was just recently fixed in the current development version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With