Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use of "list" in data.table's j argument

Tags:

r

data.table

I am learning data.table properties from a blog post. I am trying to understand the part under "summary table (short and narrow)", starting by coercing data.frame(mtcars) to data.table:

> data <- as.data.table(mtcars)

> data <- data[,.(gear,cyl)]
> head(data)
    gear cyl
 1:    4   6
 2:    4   6
 3:    4   4
 4:    3   6
 5:    3   8
 6:    3   6

Up to this point everything is fine.

Now I have tried this data[, gearsL := list(list(unique(gear))), by=cyl]

> head(data)
   gear cyl gearsL
1:    4   6  4,3,5
2:    4   6  4,3,5
3:    4   4  4,3,5
4:    3   6  4,3,5
5:    3   8    3,5
6:    3   6  4,3,5

I am able to understand unique(gear) but unable to understand what list(list(unique(gear)) is doing.

like image 782
cryptomanic Avatar asked Oct 13 '15 21:10

cryptomanic


1 Answers

A data.table -- like any data.frame -- is a list of pointers to column vectors.

When creating new columns, we write j of DT[i,j,by] so that it evaluates to a list of columns:

DT[, (newcol_names) := list(newcol_A, newcol_B)]

That's what the outermost list() in the OP's example does, for a single list column.

data[,gearsL := list(list(unique(gear))), by=cyl]

This can and should be written using the alias .(), for clarity:

data[, gearsL := .(list(unique(gear))), by=cyl]

That's all you need to know, but I've put some elaboration below.


Details. When creating a new column, we can often skip list()/.():

DT = data.table(id=1:3)
DT[, E := c(4,5,6)]
DT[, R := 3]
# this works as if we had typed
# R := c(3,3,3)

Note that E enumerates each value, while R recycles a single value over all rows. Next example:

DT[, Elist := list(hist(rpois(1,1)), hist(rpois(2,2)), hist(rpois(3,3)))]

As we did for E, we're enumerating the values of Elist here. This still uses the shortcut; list() is here only because the column is itself a list, as confirmed by

sapply(DT, class)
#        id         E         R     Elist 
# "integer" "numeric" "numeric"    "list" 

The convenient shortcut of skipping list()/.() fails in one special case: when we are creating a list column that that recycles its value:

DT[, Rlist := list(c("a","b"))]
# based on the pattern for column R, this should work as if we typed 
# Rlist := list(c("a","b"), c("a","b"), c("a","b"))

It doesn't work because the parser sees this as C2 := .( c("a", "b") ) and thinks we simply neglected to make a full enumeration with one value for each row, like Elist does. To get the desired result, skip the shortcut and wrap the vector in list()/.():

DT[, Rlist := .(list(c("a","b")))]

#    id E R       Elist Rlist
# 1:  1 4 3 <histogram>   a,b
# 2:  2 5 3 <histogram>   a,b
# 3:  3 6 3 <histogram>   a,b

This is the case in the OP's example, where the outer list()/.() is necessary.

like image 192
Frank Avatar answered Oct 15 '22 09:10

Frank