Well this one's a bit strange... Seems that by creating a new column in a data.table by using the := operator, a previously assigned variable (created using colnames) changes silently.
Is this expected behaviour? If not what's at fault?
# Lets make a simple data table
require(data.table)
dt <- data.table(fruit=c("apple","banana","cherry"),quantity=c(5,8,23))
dt
fruit quantity
1: apple 5
2: banana 8
3: cherry 23
# and assign the column names to a variable
colsdt <- colnames(dt)
str(colsdt)
chr [1:2] "fruit" "quantity"
# Now let's add a column to the data table using the := operator
dt[,double_quantity:=quantity*2]
dt
fruit quantity double_quantity
1: apple 5 10
2: banana 8 16
3: cherry 23 46
# ... and WITHOUT explicitly changing 'colsdt', let's take another look:
str(colsdt)
chr [1:3] "fruit" "quantity" "double_quantity"
# ... colsdt has been silently updated!
For comparison's sake, I though I'd see if adding a new column via data.frame method has the same issue. It doesn't:
dt$triple_quantity=dt$quantity*3
dt
fruit quantity double_quantity triple_quantity
1: apple 5 10 15
2: banana 8 16 24
3: cherry 23 46 69
# ... again I make no explicit changes to colsdt, so let's take a look:
str(colsdt)
chr [1:3] "fruit" "quantity" "double_quantity"
# ... and this time it is NOT silently updated
So is this a bug with the data.table := operator, or expected behaviour?
Thanks!
Short Answer, use copy
colsdt <- copy(colnames(dt))
Then you are all good.
dt[,double_quantity:=quantity*2]
str(colsdt)
# chr [1:2] "fruit" "quantity"
What's going in is that in general (ie, in base R
), the assignment operator <-
creates a new copy of the object when assigning a value to an object. This is true even when assigning to the same object name, as in x <- x + 1
, or a lot more costly, DF$newCol <- DF$a + DF$b
. With large objects (think 100K+ rows, dozens or hundreds of columns. Worse if more columns), this can get very costly.
data.table
, through pure wizardry (read: C code) avoids this overhead. Instead what it does is set a pointer to the
same memory location where the object value is already stored. This is what offers the huge efficiency & spped boost.
But it also means that you often have objects that might otherwise appear to be completely differnet and independent objects are in fact one and the same
This is where copy
comes in. It creates a new copy of an object, as opposed to passing by reference.
note: I am using the terms "source" and "destination" very loosely, where they refer to the assignment relationship destination <- source
This is in fact expected behavoir, admittadly a bit obfuscated.
In base R
, when you assign via <-
, the two objects point to the same memory location until one of them changes.
This way of handling memory has many benefits, namely that so long as the two objects have the same exact value, there is no need to duplicate memory. This step is held off as long as possible.
a <- 1:5
b <- a
.Internal(inspect(a)) # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
.Internal(inspect(b)) # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
^^^^ Notice the same memory location
Once either of the two objects change, then that "bond" is broken. That is, changing either the "source" or "destination" object will cause that object to be reassigned to a new memory location.
a[[3]] <- a[[3]] + 1
.Internal(inspect(a)) # @11004bc38 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,2,4,4,5
^^^^ New Location
.Internal(inspect(b)) # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
^^^^^ Still same as it was before;
note the actual value. This is where `a` _had_ been
The problem in data.table
s case is that we rarely reassign the actual data.table object.
Notice that if we modify the "destination" object, then it gets moved (copied) off of that memory location.
colsdt <- colnames(dt)
.Internal(inspect(colnames(dt))) # @114859280 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
.Internal(inspect(colsdt)) # @114859280 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
^^^^ Notice the same memory location
# insiginificant change
colsdt[] <- colsdt
.Internal(inspect(colsdt)) # @100aa4a40 16 STRSXP g0c2 [NAM(1)] (len=2, tl=100)
# we can test the original issue from the OP:
dt[, newCol := quantity*2]
str(colnames(dt)) # chr [1:3] "fruit" "quantity" "newCol"
str(colsdt) # chr [1:2] "fruit" "quantity"
However, since when working with data.table
, we are (almost) always modifying by reference, this can cause unexpected results. Namely, the situation where:
<-
assignment operatorThis of course will cause an issue.
data.table
is an amazingly powerful package. The source of its strength is its long hair the fact that it avoids making copies whenever possible.
This shifts the onus to the user to be deliberate and judicious when copying and expecting for a copy to be made.
In other words, the best practices is: When you expect a copy to exist, use the copy function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With