A basic property of data.table is that
"As long as
jreturns a list, each element of the list becomes a column in the resulting data.table."
This is shown e.g. in an example from ?data.table:
library(data.table)
DT[, c(.N, lapply(.SD, sum)), by=x]
Here the integer .N is concatenated with the list resulting from lapply, and the overall result is a list, i.e. .N is implicitly coerced to a list element (according to the coercion hierarchy described in ?c)
What caught my attention was the example used in both ?data.table and .SD where, in contrast to above, the 'non-list' part of j is explicitly converted to a list:
DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]
It is not immediately obvious to me why the single number resulting from y=max(y) in this example is converted to a list (.()) when it will be converted to a list element anyway, following the concatenation with a list (lapply(.SD., ).
list not needed?Here's a small example where max(y) and sum of all variables are calculated by a grouping variable. The structure and the result of the calculations are indeed the same, both when explicitly converting the 'non-list' result in j to a list, and when not doing so:
dt <- data.table(grp = rep(c("a", "b"), 2:3), x = 1:5, y = 2:6)
# - structure of j
dt[ , str(c(.(ymax = max(y)), lapply(.SD, sum))), by = grp]
dt[ , str(c(ymax = max(y), lapply(.SD, sum))), by = grp]
# - result of j
dt[ , c(.(ymax = max(y)), lapply(.SD, sum)), by = grp]
dt[ , c(ymax = max(y), lapply(.SD, sum)), by = grp]
# ...both give the same result:
Thus, in these examples, explicitly converting the non-list part of j to a list seems redundant. So is it really needed? Perhaps next example with .N provides a clue.
list needed with .N!Is this case the 'non-list' part of j is .N, otherwise j is the same as above.
The structure of j is the same, both with and without explicit conversion of .N to a list in j:
dt[ , str(c(.(n = .N), lapply(.SD, sum))), by = grp]
dt[ , str(c(n = .N, lapply(.SD, sum))), by = grp]
# List of 3
# $ n: int 2
# $ x: int 3
# $ y: int 5
# List of 3
# $ n: int 3
# $ x: int 12
# $ y: int 15
~~> Note that according to str, in both cases the ".N variable" has the name which was set in j, "n".
However, if .N is not explicitly "listed" in j, the name of .N remains the default "N" (see ?.N) in the results:
dt[ , c(.(n = .N), lapply(.SD, sum)), by = grp]
# grp n x y
# 1: a 2 3 5
# 2: b 3 12 15
dt[ , c(n = .N, lapply(.SD, sum)), by = grp]
# grp N x y
# 1: a 2 3 5
# 2: b 3 12 15
Of course my question is not about trying to avoid typing the three characters .(), but to understand the fundamentals of how j can/should be specified.
Is the default name of .N (as shown above) the exception which warrants the use of explicit list, always, just to be on the safe side? Are there other pitfalls which I have overlooked?
First, thanks to @Frank for pointing out that the behaviour of the name of .N might be a bug. Although a minor issue itself (posted one though), it indeed seems like it the was the cause of my more general confusion of the use of list in j.
I made some further simple tests on .N and its name, which suggest some potential inconsistencies. FWIW, I thought I might just as well share them here (too long for comment).
.N in result1a: .N is autonamed 'N' when (1) using .N only, (2) .N with by, or (3) .N with lapply and by.
1b: .N is autonamed 'V1', instead of 'N' when using (1) .N with lapply, or (2) list(.N) with lapply
# 1a: .N is autonamed 'N'
# .N only
dt[ , .(.N)]
# N
# 1: 5
# .N + by
dt[ , .(.N), by = grp]
# grp N
# 1: a 2
# 2: b 3
# .N + lapply + by
dt[ , c(.N, lapply(.SD, max)), by = grp]
# grp N x y
# 1: a 2 2 3
# 2: b 3 5 6
# 1b: .N is autonamed V1, instead of N
# .N + lapply
dt[ , c(.N, lapply(.SD, max))]
# V1 grp x y
# 1: 5 b 5 6
# list(.N) + lapply
d[ , c(.(.N), lapply(.SD, max))]
# V1 grp x y
# 1: 5 2 5 10
.N2a: list(.N) is not needed when using .N with lapply
2b: list(.N) is needed when using .N with lapply and by.
# 2a: list(.N) not needed
# .N + lapply
dt[ , c(n = .N, lapply(.SD, max))]
# n grp x y
# 1: 5 b 5 6
# 2b: list(.N) needed
# .N + lapply + by
dt[ , c(.(n = .N), lapply(.SD, max)), by = grp]
# grp n x y
# 1: a 2 2 3
# 2: b 3 5 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With