Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set a variable using colnames(), update data.table using := operator, variable is silently updated? [duplicate]

Tags:

r

data.table

Well this one's a bit strange... Seems that by creating a new column in a data.table by using the := operator, a previously assigned variable (created using colnames) changes silently.

Is this expected behaviour? If not what's at fault?

# Lets make a simple data table
require(data.table)
dt <- data.table(fruit=c("apple","banana","cherry"),quantity=c(5,8,23))
dt
    fruit quantity
1:  apple        5
2: banana        8
3: cherry       23

# and assign the column names to a variable
colsdt <- colnames(dt)
str(colsdt)
 chr [1:2] "fruit" "quantity"

# Now let's add a column to the data table using the := operator
dt[,double_quantity:=quantity*2]
dt
    fruit quantity double_quantity
1:  apple        5              10
2: banana        8              16
3: cherry       23              46

# ... and WITHOUT explicitly changing 'colsdt', let's take another look:
str(colsdt)
 chr [1:3] "fruit" "quantity" "double_quantity"

# ... colsdt has been silently updated!

For comparison's sake, I though I'd see if adding a new column via data.frame method has the same issue. It doesn't:

dt$triple_quantity=dt$quantity*3
dt
    fruit quantity double_quantity triple_quantity
1:  apple        5              10              15
2: banana        8              16              24
3: cherry       23              46              69

# ... again I make no explicit changes to colsdt, so let's take a look:
str(colsdt)
 chr [1:3] "fruit" "quantity" "double_quantity"

# ... and this time it is NOT silently updated

So is this a bug with the data.table := operator, or expected behaviour?

Thanks!

like image 461
jonsedar Avatar asked May 14 '13 12:05

jonsedar


1 Answers

Short Answer, use copy

colsdt <- copy(colnames(dt))

Then you are all good.

dt[,double_quantity:=quantity*2]
str(colsdt)
# chr [1:2] "fruit" "quantity"

What's going in is that in general (ie, in base R), the assignment operator <- creates a new copy of the object when assigning a value to an object. This is true even when assigning to the same object name, as in x <- x + 1, or a lot more costly, DF$newCol <- DF$a + DF$b. With large objects (think 100K+ rows, dozens or hundreds of columns. Worse if more columns), this can get very costly.

data.table, through pure wizardry (read: C code) avoids this overhead. Instead what it does is set a pointer to the same memory location where the object value is already stored. This is what offers the huge efficiency & spped boost.

But it also means that you often have objects that might otherwise appear to be completely differnet and independent objects are in fact one and the same

This is where copy comes in. It creates a new copy of an object, as opposed to passing by reference.


some more detail as to why this is happening.

note: I am using the terms "source" and "destination" very loosely, where they refer to the assignment relationship destination <- source

This is in fact expected behavoir, admittadly a bit obfuscated.

In base R, when you assign via <-, the two objects point to the same memory location until one of them changes. This way of handling memory has many benefits, namely that so long as the two objects have the same exact value, there is no need to duplicate memory. This step is held off as long as possible.

a <- 1:5
b <- a
.Internal(inspect(a))  # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
.Internal(inspect(b))  # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
                            ^^^^  Notice the same memory location

Once either of the two objects change, then that "bond" is broken. That is, changing either the "source" or "destination" object will cause that object to be reassigned to a new memory location.

a[[3]] <- a[[3]] + 1
.Internal(inspect(a))  # @11004bc38 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,2,4,4,5
                             ^^^^ New Location
.Internal(inspect(b))  # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
                          ^^^^^ Still same as it was before; 
                                note the actual value. This is where `a` _had_ been

The problem in data.tables case is that we rarely reassign the actual data.table object. Notice that if we modify the "destination" object, then it gets moved (copied) off of that memory location.

colsdt <- colnames(dt)
.Internal(inspect(colnames(dt)))  # @114859280 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
.Internal(inspect(colsdt))        # @114859280 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
                                      ^^^^  Notice the same memory location
# insiginificant change
colsdt[] <- colsdt
.Internal(inspect(colsdt))       # @100aa4a40 16 STRSXP g0c2 [NAM(1)] (len=2, tl=100)

# we can test the original issue from the OP:
dt[, newCol := quantity*2]
str(colnames(dt))   #  chr [1:3] "fruit" "quantity" "newCol"
str(colsdt)         #  chr [1:2] "fruit" "quantity"

The situation to avoid:

However, since when working with data.table, we are (almost) always modifying by reference, this can cause unexpected results. Namely, the situation where:

  • we assign from a data.table object using standard <- assignment operator
  • then subsequently we change the value of the "source" data.table
  • we expect (and our code might depend on) the "destination" object to still have the value previously assigned to it.

This of course will cause an issue.

data.table is an amazingly powerful package. The source of its strength is its long hair the fact that it avoids making copies whenever possible.

Best Practice:

This shifts the onus to the user to be deliberate and judicious when copying and expecting for a copy to be made.

In other words, the best practices is: When you expect a copy to exist, use the copy function.

like image 187
Ricardo Saporta Avatar answered Oct 09 '22 16:10

Ricardo Saporta