Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do we need to convert single elements of j to a list when the overall result of j is a list anyway?

Tags:

r

data.table

Background

A basic property of data.table is that

"As long as j returns a list, each element of the list becomes a column in the resulting data.table."

This is shown e.g. in an example from ?data.table:

library(data.table)
DT[, c(.N, lapply(.SD, sum)), by=x]

Here the integer .N is concatenated with the list resulting from lapply, and the overall result is a list, i.e. .N is implicitly coerced to a list element (according to the coercion hierarchy described in ?c)

What caught my attention was the example used in both ?data.table and .SD where, in contrast to above, the 'non-list' part of j is explicitly converted to a list:

DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]

It is not immediately obvious to me why the single number resulting from y=max(y) in this example is converted to a list (.()) when it will be converted to a list element anyway, following the concatenation with a list (lapply(.SD., ).


explicit list not needed?

Here's a small example where max(y) and sum of all variables are calculated by a grouping variable. The structure and the result of the calculations are indeed the same, both when explicitly converting the 'non-list' result in j to a list, and when not doing so:

dt <- data.table(grp = rep(c("a", "b"), 2:3), x = 1:5, y = 2:6)

# - structure of j
dt[ , str(c(.(ymax = max(y)), lapply(.SD, sum))), by = grp]
dt[ , str(c(ymax = max(y), lapply(.SD, sum))), by = grp]

# - result of j
dt[ , c(.(ymax = max(y)), lapply(.SD, sum)), by = grp]
dt[ , c(ymax = max(y), lapply(.SD, sum)), by = grp]

# ...both give the same result:

Thus, in these examples, explicitly converting the non-list part of j to a list seems redundant. So is it really needed? Perhaps next example with .N provides a clue.


explicit list needed with .N!

Is this case the 'non-list' part of j is .N, otherwise j is the same as above.

The structure of j is the same, both with and without explicit conversion of .N to a list in j:

dt[ , str(c(.(n = .N), lapply(.SD, sum))), by = grp]
dt[ , str(c(n = .N, lapply(.SD, sum))), by = grp]

# List of 3
#  $ n: int 2
#  $ x: int 3
#  $ y: int 5
# List of 3
#  $ n: int 3
#  $ x: int 12
#  $ y: int 15

~~> Note that according to str, in both cases the ".N variable" has the name which was set in j, "n".

However, if .N is not explicitly "listed" in j, the name of .N remains the default "N" (see ?.N) in the results:

dt[ , c(.(n = .N), lapply(.SD, sum)), by = grp]
#    grp n  x  y
# 1:   a 2  3  5
# 2:   b 3 12 15

dt[ , c(n = .N, lapply(.SD, sum)), by = grp]
#    grp N  x  y
# 1:   a 2  3  5
# 2:   b 3 12 15

Question

Of course my question is not about trying to avoid typing the three characters .(), but to understand the fundamentals of how j can/should be specified.

Is the default name of .N (as shown above) the exception which warrants the use of explicit list, always, just to be on the safe side? Are there other pitfalls which I have overlooked?

like image 869
Henrik Avatar asked Jan 18 '26 13:01

Henrik


1 Answers

First, thanks to @Frank for pointing out that the behaviour of the name of .N might be a bug. Although a minor issue itself (posted one though), it indeed seems like it the was the cause of my more general confusion of the use of list in j.

I made some further simple tests on .N and its name, which suggest some potential inconsistencies. FWIW, I thought I might just as well share them here (too long for comment).

1: The name of .N in result

1a: .N is autonamed 'N' when (1) using .N only, (2) .N with by, or (3) .N with lapply and by.

1b: .N is autonamed 'V1', instead of 'N' when using (1) .N with lapply, or (2) list(.N) with lapply

# 1a: .N is autonamed 'N'

# .N only
dt[ , .(.N)]
#    N
# 1: 5

# .N + by
dt[ , .(.N), by = grp]
#   grp N
# 1:  a 2
# 2:  b 3

# .N + lapply + by 
dt[ , c(.N, lapply(.SD, max)), by = grp]
#    grp N x y
# 1:   a 2 2 3 
# 2:   b 3 5 6


# 1b: .N is autonamed V1, instead of N

# .N + lapply
dt[ , c(.N, lapply(.SD, max))]
#    V1 grp x y
# 1:  5   b 5 6

# list(.N) + lapply
d[ , c(.(.N), lapply(.SD, max))]
#    V1 grp x  y
# 1:  5   2 5 10

2: renaming .N

2a: list(.N) is not needed when using .N with lapply

2b: list(.N) is needed when using .N with lapply and by.

# 2a: list(.N) not needed

# .N + lapply
dt[ , c(n = .N, lapply(.SD, max))]
#    n grp x y
# 1: 5   b 5 6


# 2b: list(.N) needed

# .N + lapply + by
dt[ , c(.(n = .N), lapply(.SD, max)), by = grp]
#    grp n x y
# 1:   a 2 2 3
# 2:   b 3 5 6

like image 179
Henrik Avatar answered Jan 21 '26 06:01

Henrik



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!