Check this toy code:
> x <- data.table(a = 1:2)
> foo <- function(z) { z[, b:=3:4] }
> y <- foo(x)
> x[]
a b
1: 1 3
2: 2 4
It seems data.table is passed by reference. Is this intentional? Is this documented? I did read through the docs and couldn't find a mention of this behaviour.
I'm not asking about R's documented reference semantics (in :=, set*** and some others). I'm asking whether a data.table complete object is supposed to be passed by reference as a function argument.
Edit: Following @Oliver's answer, here are some more curious examples.
> dt<- data.table(a=1:2)
> attr(dt, ".internal.selfref")
<pointer: 0x564776a93e88>
> address(dt)
[1] "0x5647bc0f6c50"
>
> ff<-function(x) { x[, b:=3:4]; print(address(x)); print(attr(dt, ".internal.selfref")) }
> ff(dt)
[1] "0x5647bc0f6c50"
<pointer: 0x564776a93e88>
So not only is .internal.selfref identical to the caller's dt copy, so is the address. It really is the same object. (I think).
This is not exactly the case for data.frames:
> df<- data.frame(a=1:2)
> address(df)
[1] "0x5647b39d21e8"
> ff<-function(x) { print(address(x)); x$b=3:4; print(address(x)) }
>
> ff(df)
[1] "0x5647b39d21e8"
[1] "0x5647ae24de78"
Maybe the root issue is that regular data.table operations somehow do not trigger R's copy-on-modify semantics?
I think what you're being surprised about is actually R behavior, which is why it's not specifically documented in data.table (maybe it should be anyway, as the implications are more important for data.table).
You were surprised that the object passed to a function had the same address, but this is the same for base R as well:
x = 1:10
address(x)
# [1] "0x7fb7d4b6c820"
(\(y) print(address(y)))(x)
# [1] "0x7fb7d4b6c820"
What's being copied in the function environment is the pointer to x. Moreover, for base R, the parent x is immutable:
foo = function(y) {
print(address(y))
y[1L] = 2L
print(address(y))
}
foo(x)
# [1] "0x7fb7d4b6c820"
# [1] "0x7fb7d4e11d28"
That is, as soon as we try to edit y, a copy is made. This is related to reference counting -- you can see some work by Luke Tierney on this, e.g. this presentation
The difference for data.table is that data.table enables edit permissions for the parent object -- a double-edged sword as I think you know.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With