Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is data.table casting column classes when I assign all columns by reference

Tags:

r

data.table

Here is something I do not understand with data.table If I select a line and I try to set all values of this line to NA the new line-data.table is coerced to logical

#Here is a sample table
DT <- data.table(a=rep(1L,3),b=rep(1.1,3),d=rep('aa',3))
DT
#    a   b  d
# 1: 1 1.1 aa
# 2: 1 1.1 aa
# 3: 1 1.1 aa

#Here I extract a line, all the column types are kept... good
str(DT[1])
# Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
#  $ a: int 1
#  $ b: num 1.1
#  $ d: chr "aa"
#  - attr(*, ".internal.selfref")=<externalptr> 

#Now here I want to set them all to `NA`...they all become logicals => WHY IS THAT ?
str(DT[1][,colnames(DT) := NA])
# Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
#  $ a: logi NA
#  $ b: logi NA
#  $ d: logi NA
#  - attr(*, ".internal.selfref")=<externalptr> 

EDIT: I think it is a bug as

str(DT[1][ , a := NA])
# Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
#  $ a: logi NA
#  $ b: num 1.1
#  $ d: chr "aa"
#  - attr(*, ".internal.selfref")=<externalptr> 

str(DT[1:2][ , a := NA])
# Classes ‘data.table’ and 'data.frame':  2 obs. of  3 variables:
#  $ a: int  NA NA
#  $ b: num  1.1 1.1
#  $ d: chr  "aa" "aa"
#  - attr(*, ".internal.selfref")=<externalptr> 
like image 907
statquant Avatar asked Sep 03 '13 13:09

statquant


1 Answers

To provide an answer, from ?":=" :

Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.

The motivation for all this is large tables (say 10GB in RAM), of course. Not 1 or 2 row tables.

To put it more simply: if length(RHS) == nrow(DT) then the RHS (and whatever its type) is plonked into that column slot. Even if those lengths are 1. If length(RHS) < nrow(DT), the memory for the column (and its type) is kept in place, but the RHS is coerced and recycled to replace the (subset of) items in that column.

If I need to change a column's type in a large table I write:

DT[, col := as.numeric(col)]

here as.numeric allocates a new vector, coerces "col" into that new memory, which is then plonked into the column slot. It's as efficient as it can be. The reason that's a plonk is because length(RHS) == nrow(DT).

If you want to overwrite a column with a different type containing some default value:

DT[, col := rep(21.5, nrow(DT))]    # i.e., deliberately harder

If "col" was type integer before, then it'll change to type numeric containing 21.5 for every row. Otherwise just DT[, col := 21.5] would result in a warning about 21.5 being coerced to 21 (unless DT is only 1 row!)

like image 140
Matt Dowle Avatar answered Sep 30 '22 14:09

Matt Dowle