Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table modifies parent environment / weird behavior with setDT

Tags:

r

data.table

So if I build my data.table with a data.frame of existing vectors and setDT, the original vector get modified in the parent environment:

a <- 1:2 / 2
x <- 1:10 / 2
y <- 11/2
dt <- data.frame(a, x, y)
setDT(dt)
dt[ , cond := a == 1]
dt[(cond), c("x", "y") := list(y, x)]
x
#[1] 0.5 5.5 1.5 5.5 2.5 5.5 3.5 5.5 4.5 5.5

For Info I use R 3.5.1 and data.table 1.11.4

If I use data.table constructor instead of data.frame + setDT it does not modify the vector x.

a <- 1:2 / 2
x <- 1:10 / 2
y <- 11/2
dt <- data.table(a, x, y)
dt[ , cond := a == 1]
dt[(cond), c("x", "y") := list(y, x)]
x
#[1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Can somebody explain what's happening to me and if it's a bug?

Cheers

EDIT1: just found this related issue on github https://github.com/Rdatatable/data.table/issues/2683

EDIT2: the suspect was obviously "copy by reference" such that the memory addresses of the vectors x and dt$x are the same, hence it modifies the vector outside the data.table. I would have thought the data.frame creation would have made a copy...

> a <- 1:2 / 2
> x <- 1:10 / 2
> y <- 11/2
> dt <- setDT(as.data.frame(list(a = a, x = x, y = y)))
> dt[ , cond := a == 1]
> dt[(cond), c("x", "y") := list(y, x)]
> x
[1] 0.5 5.5 1.5 5.5 2.5 5.5 3.5 5.5 4.5 5.5
> address(dt$x)
[1] "0xadd8fe8"
> address(x)
[1] "0xadd8fe8"
like image 478
BenoitLondon Avatar asked Sep 04 '18 16:09

BenoitLondon


1 Answers

setDT modifies input object by reference. If the object being used as input is itself created by performing a shallow copy (as opposed to a deep copy), then all such objects will be modified while using := or set() from data.table.

data.frame() seems to be creating shallow copies of input objects upon creation wherever possible to be more efficient. So address(df$x) and address(x) are identical. That's acceptable since R performs a copy-on-modify.

You can avoid such scenarios by creating data.tables directly. If instead, a data.frame object is directly given to you, and you've no idea about how it was created, better to use copy(). HTH.

like image 156
Arun Avatar answered Oct 02 '22 22:10

Arun