Here is something I do not understand with data.table
If I select a line and I try to set all values of this line to NA
the new line-data.table is coerced to logical
#Here is a sample table
DT <- data.table(a=rep(1L,3),b=rep(1.1,3),d=rep('aa',3))
DT
# a b d
# 1: 1 1.1 aa
# 2: 1 1.1 aa
# 3: 1 1.1 aa
#Here I extract a line, all the column types are kept... good
str(DT[1])
# Classes ‘data.table’ and 'data.frame': 1 obs. of 3 variables:
# $ a: int 1
# $ b: num 1.1
# $ d: chr "aa"
# - attr(*, ".internal.selfref")=<externalptr>
#Now here I want to set them all to `NA`...they all become logicals => WHY IS THAT ?
str(DT[1][,colnames(DT) := NA])
# Classes ‘data.table’ and 'data.frame': 1 obs. of 3 variables:
# $ a: logi NA
# $ b: logi NA
# $ d: logi NA
# - attr(*, ".internal.selfref")=<externalptr>
EDIT: I think it is a bug as
str(DT[1][ , a := NA])
# Classes ‘data.table’ and 'data.frame': 1 obs. of 3 variables:
# $ a: logi NA
# $ b: num 1.1
# $ d: chr "aa"
# - attr(*, ".internal.selfref")=<externalptr>
str(DT[1:2][ , a := NA])
# Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
# $ a: int NA NA
# $ b: num 1.1 1.1
# $ d: chr "aa" "aa"
# - attr(*, ".internal.selfref")=<externalptr>
To provide an answer, from ?":="
:
Unlike
<-
fordata.frame
, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.
The motivation for all this is large tables (say 10GB in RAM), of course. Not 1 or 2 row tables.
To put it more simply: if length(RHS) == nrow(DT)
then the RHS (and whatever its type) is plonked into that column slot. Even if those lengths are 1. If length(RHS) < nrow(DT)
, the memory for the column (and its type) is kept in place, but the RHS is coerced and recycled to replace the (subset of) items in that column.
If I need to change a column's type in a large table I write:
DT[, col := as.numeric(col)]
here as.numeric
allocates a new vector, coerces "col" into that new memory, which is then plonked into the column slot. It's as efficient as it can be. The reason that's a plonk is because length(RHS) == nrow(DT)
.
If you want to overwrite a column with a different type containing some default value:
DT[, col := rep(21.5, nrow(DT))] # i.e., deliberately harder
If "col" was type integer before, then it'll change to type numeric containing 21.5 for every row. Otherwise just DT[, col := 21.5]
would result in a warning about 21.5 being coerced to 21 (unless DT is only 1 row!)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With