I am learning data.table properties from a blog post. I am trying to understand the part under "summary table (short and narrow)", starting by coercing data.frame(mtcars) to data.table:
> data <- as.data.table(mtcars)
> data <- data[,.(gear,cyl)]
> head(data)
gear cyl
1: 4 6
2: 4 6
3: 4 4
4: 3 6
5: 3 8
6: 3 6
Up to this point everything is fine.
Now I have tried this data[, gearsL := list(list(unique(gear))), by=cyl]
> head(data)
gear cyl gearsL
1: 4 6 4,3,5
2: 4 6 4,3,5
3: 4 4 4,3,5
4: 3 6 4,3,5
5: 3 8 3,5
6: 3 6 4,3,5
I am able to understand unique(gear) but unable to understand what list(list(unique(gear)) is doing.
A data.table -- like any data.frame -- is a list of pointers to column vectors.
When creating new columns, we write j
of DT[i,j,by]
so that it evaluates to a list of columns:
DT[, (newcol_names) := list(newcol_A, newcol_B)]
That's what the outermost list()
in the OP's example does, for a single list
column.
data[,gearsL := list(list(unique(gear))), by=cyl]
This can and should be written using the alias .()
, for clarity:
data[, gearsL := .(list(unique(gear))), by=cyl]
That's all you need to know, but I've put some elaboration below.
Details. When creating a new column, we can often skip list()
/.()
:
DT = data.table(id=1:3)
DT[, E := c(4,5,6)]
DT[, R := 3]
# this works as if we had typed
# R := c(3,3,3)
Note that E
enumerates each value, while R
recycles a single value over all rows. Next example:
DT[, Elist := list(hist(rpois(1,1)), hist(rpois(2,2)), hist(rpois(3,3)))]
As we did for E
, we're enumerating the values of Elist
here. This still uses the shortcut; list()
is here only because the column is itself a list
, as confirmed by
sapply(DT, class)
# id E R Elist
# "integer" "numeric" "numeric" "list"
The convenient shortcut of skipping list()
/.()
fails in one special case: when we are creating a list
column that that recycles its value:
DT[, Rlist := list(c("a","b"))]
# based on the pattern for column R, this should work as if we typed
# Rlist := list(c("a","b"), c("a","b"), c("a","b"))
It doesn't work because the parser sees this as C2 := .( c("a", "b") )
and thinks we simply neglected to make a full enumeration with one value for each row, like Elist
does. To get the desired result, skip the shortcut and wrap the vector in list()
/.()
:
DT[, Rlist := .(list(c("a","b")))]
# id E R Elist Rlist
# 1: 1 4 3 <histogram> a,b
# 2: 2 5 3 <histogram> a,b
# 3: 3 6 3 <histogram> a,b
This is the case in the OP's example, where the outer list()
/.()
is necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With