I am learning data.table properties from a blog post. I am trying to understand the part under "summary table (short and narrow)", starting by coercing data.frame(mtcars) to data.table:
> data <- as.data.table(mtcars)
> data <- data[,.(gear,cyl)]
> head(data)
    gear cyl
 1:    4   6
 2:    4   6
 3:    4   4
 4:    3   6
 5:    3   8
 6:    3   6
Up to this point everything is fine.
Now I have tried this data[, gearsL := list(list(unique(gear))), by=cyl]
> head(data)
   gear cyl gearsL
1:    4   6  4,3,5
2:    4   6  4,3,5
3:    4   4  4,3,5
4:    3   6  4,3,5
5:    3   8    3,5
6:    3   6  4,3,5
I am able to understand unique(gear) but unable to understand what list(list(unique(gear)) is doing.
A data.table -- like any data.frame -- is a list of pointers to column vectors.
When creating new columns, we write j of DT[i,j,by] so that it evaluates to a list of columns:
DT[, (newcol_names) := list(newcol_A, newcol_B)]
That's what the outermost list() in the OP's example does, for a single list column.
data[,gearsL := list(list(unique(gear))), by=cyl]
This can and should be written using the alias .(), for clarity:
data[, gearsL := .(list(unique(gear))), by=cyl]
That's all you need to know, but I've put some elaboration below.
Details. When creating a new column, we can often skip list()/.():
DT = data.table(id=1:3)
DT[, E := c(4,5,6)]
DT[, R := 3]
# this works as if we had typed
# R := c(3,3,3)
Note that E enumerates each value, while R recycles a single value over all rows. Next example:
DT[, Elist := list(hist(rpois(1,1)), hist(rpois(2,2)), hist(rpois(3,3)))]
As we did for E, we're enumerating the values of Elist here. This still uses the shortcut; list() is here only because the column is itself a list, as confirmed by
sapply(DT, class)
#        id         E         R     Elist 
# "integer" "numeric" "numeric"    "list" 
The convenient shortcut of skipping list()/.() fails in one special case: when we are creating a list column that that recycles its value:
DT[, Rlist := list(c("a","b"))]
# based on the pattern for column R, this should work as if we typed 
# Rlist := list(c("a","b"), c("a","b"), c("a","b"))
It doesn't work because the parser sees this as C2 := .( c("a", "b") ) and thinks we simply neglected to make a full enumeration with one value for each row, like Elist does. To get the desired result, skip the shortcut and wrap the vector in list()/.():
DT[, Rlist := .(list(c("a","b")))]
#    id E R       Elist Rlist
# 1:  1 4 3 <histogram>   a,b
# 2:  2 5 3 <histogram>   a,b
# 3:  3 6 3 <histogram>   a,b
This is the case in the OP's example, where the outer list()/.() is necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With