Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is R data.table documented to pass by reference as argument?

Tags:

r

data.table

Check this toy code:

> x <- data.table(a = 1:2) 
> foo <- function(z) { z[, b:=3:4]  }
> y <- foo(x)
> x[]
   a b
1: 1 3
2: 2 4

It seems data.table is passed by reference. Is this intentional? Is this documented? I did read through the docs and couldn't find a mention of this behaviour.

I'm not asking about R's documented reference semantics (in :=, set*** and some others). I'm asking whether a data.table complete object is supposed to be passed by reference as a function argument.


Edit: Following @Oliver's answer, here are some more curious examples.

> dt<- data.table(a=1:2)
> attr(dt, ".internal.selfref")
<pointer: 0x564776a93e88>
> address(dt)
[1] "0x5647bc0f6c50"
> 
> ff<-function(x) { x[, b:=3:4]; print(address(x)); print(attr(dt, ".internal.selfref")) }
> ff(dt)
[1] "0x5647bc0f6c50"
<pointer: 0x564776a93e88>

So not only is .internal.selfref identical to the caller's dt copy, so is the address. It really is the same object. (I think).

This is not exactly the case for data.frames:

> df<- data.frame(a=1:2)
> address(df)
[1] "0x5647b39d21e8"
> ff<-function(x) { print(address(x)); x$b=3:4; print(address(x)) }
> 
> ff(df)
[1] "0x5647b39d21e8"
[1] "0x5647ae24de78"

Maybe the root issue is that regular data.table operations somehow do not trigger R's copy-on-modify semantics?

like image 280
Ofek Shilon Avatar asked Oct 22 '25 07:10

Ofek Shilon


1 Answers

I think what you're being surprised about is actually R behavior, which is why it's not specifically documented in data.table (maybe it should be anyway, as the implications are more important for data.table).

You were surprised that the object passed to a function had the same address, but this is the same for base R as well:

x = 1:10
address(x)
# [1] "0x7fb7d4b6c820"
(\(y) print(address(y)))(x)
# [1] "0x7fb7d4b6c820"

What's being copied in the function environment is the pointer to x. Moreover, for base R, the parent x is immutable:

foo = function(y) {
  print(address(y))
  y[1L] = 2L
  print(address(y))
}
foo(x)
# [1] "0x7fb7d4b6c820"
# [1] "0x7fb7d4e11d28"

That is, as soon as we try to edit y, a copy is made. This is related to reference counting -- you can see some work by Luke Tierney on this, e.g. this presentation

The difference for data.table is that data.table enables edit permissions for the parent object -- a double-edged sword as I think you know.

like image 174
MichaelChirico Avatar answered Oct 23 '25 22:10

MichaelChirico



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!