R data.table weird value/reference semantics

Tags:

data.table

(This is a follow up question to this.)

Check this toy code:

> x <- data.frame(a = 1:2)
> foo <- function(z) { setDT(z) ; z[, b:=3:4] ; z } 
> y <- foo(x)
> 
> class(x)
[1] "data.table" "data.frame"
> x
   a
1: 1
2: 2

It looks like setDT did change x's class, but the addition of data did not apply to x.
What happened here?

236

asked Jul 07 '20 12:07

2 Answers

In your function z is a reference to x until setDT.

library(data.table)
foo <- function(z) {print(address(z)); setDT(z); print(address(z))} 
x <- data.frame(a = 1:2)
address(x)
#[1] "0x555ec9a471e8"
foo(x)
#[1] "0x555ec9a471e8"
#[1] "0x555ec9ede300"

In setDT it comes to the following line where z is still pointing to the same address like x:

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

setattr does not make a copy. So x and z are still pointing to the same address and both are now of class data.frame:

x <- data.frame(a = 1:2)
z <- x
class(x)
#[1] "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

class(x)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

Then setalloccol is called which calls in this case:

assign("z", .Call(data.table:::Calloccolwrapper, z, 1024, FALSE))

which now let x and z point to different addresses.

address(x)
#[1] "0x555ecaa09c00"
address(z)
#[1] "0x555ec95de600"

And both have the class data.frame

class(x)
#[1] "data.table" "data.frame"
class(z)
#[1] "data.table" "data.frame"

I think when they would have used

class(z) <- data.table:::.resetclass(z, "data.frame")

instead of

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

the problem would not occur.

x <- data.frame(a = 1:2)
z <- x
address(x)
#[1] "0x555ec9cd2228"
class(z) <- data.table:::.resetclass(z, "data.frame")
class(x)
#[1] "data.frame"
class(z)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec9cd2228"
address(z)
#[1] "0x555ec9cd65a8"

but after class(z) <- value z will not point to the same address where it points before:

z <- data.frame(a = 1:2)
address(z)
#[1] "0x5653dbe72b68"
address(z$a)
#[1] "0x5653db82e140"
class(z) <- c("data.table", "data.frame")
address(z)
#[1] "0x5653dbe82d98"
address(z$a)
#[1] "0x5653db82e140"

but after setDT it will also not point to the same address where it points before:

z <- data.frame(a = 1:2)
address(z)
#[1] "0x55b6f04d0db8"
setDT(z)
address(z)
#[1] "0x55b6efe1e0e0"

As @Matt-dowle pointed out, it is also possible to change the data in x over z:

x <- data.frame(a = c(1,3))
z <- x
setDT(z)
z[, b:=3:4]
z[2, a:=7]
z
#   a b
#1: 1 3
#2: 7 4
x
#   a
#1: 1
#2: 7

R.version.string
#[1] "R version 4.0.2 (2020-06-22)"
packageVersion("data.table")
#[1] ‘1.12.8’

107

answered Oct 21 '22 16:10

setalloccol's location is indeed the direct culprit: it performs a shallow copy (i.e., generates a new vector of pointers to the existing data columns) and in addition allocates extra 1024 (by default) slots for additional columns. If setting the class to data.frame is performed after this shallow copy (either by class(z)<- or by setattr) it is applied to this new vector and not the original argument.

However.

Even after using a fixed version of setDT (with setattr called after setalloccol), it seems there is no way to get consistent behaviour. Some operations apply to the caller copy, and some don't.

df <- data.frame(a=1:2, b=3:4)

foo1 <- function(z) { 
  setDT.fixed(z)
  z[, b:=5]   # will apply to the caller copy
  data.table::setDF(z)
}

foo1(df)
#    a b
# 1: 1 5
# 2: 2 5
class(df)
# [1] "data.frame"
df
#   a b
# 1 1 5
# 2 2 5

foo2 <- function(z) { 
  setDT.fixed(z)
  z[, c:=5]   # will NOT apply to the caller copy
  data.table::setDF(z)
}
foo2(df)
#    a b c
# 1: 1 3 5
# 2: 2 4 5
# Warning message:
# In `[.data.table`(z, , `:=`(c, 5)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
class(df)
# [1] "data.table" "data.frame"
df
#    a b
# 1: 1 3
# 2: 2 4

(Using the j argument, e.g., z[!is.na(a), b:=6] gives an extra dimension of weirdness which I won't go into here).

Bottom line, the data.table package took on the brave task of punching a hole in R's all-value semantics. It was pretty successful until setDT came along (BTW, in response to a SO question here). Using setDT within a function on an argument will probably never have consistent semantics and is almost guaranteed to get you nasty surprises.

answered Oct 21 '22 16:10

Ofek Shilon

Related questions
                            
                                Calculating the analogue of Euler angles/Tait-Bryan angles for dimensions >3
                            
                                R: Plotting predictions of MASS polr ordinal model
                            
                                Login issue with gconnect() in gtrendsR package
                            
                                Simulating Data Efficiently with data.table
                            
                                How to keep abreast of known bugs and bug fixes in R packages?
                            
                                Increasing the plot area in ggplot to cope with geom_text at plot edges
                            
                                How to unlock environment in R?
                            
                                How can I make vim indent dplyr code with the pipe (%>%) operator correctly?
                            
                                == and %in% differ based on character encoding?
                            
                                Dynamically display a dashboardPage
                            
                                Why does 'out of bounds' indexing differ between a matrix and a data.frame?
                            
                                Showing equation of nls model with ggpmisc
                            
                                R Plotly animation - initial frame
                            
                                Permute a vector such that an element can't be in the same place
                            
                                Using Unicode inside R's expression() command
                            
                                R: Why does dbWriteTable fail when table exists despite 'append = TRUE'
                            
                                Shiny App unable to start on shiny server
                            
                                Create UML diagrams directly from R code
                            
                                Inserting control inputs and HTML widgets inside rhandsontable cells in shiny
                            
                                How to read a parquet file in R without using spark packages?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R data.table weird value/reference semantics

Tags:

r

data.table

Ofek Shilon

People also ask

2 Answers

GKi

Ofek Shilon

Recent Activity

Donate For Us