Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reference problem in data.table following a copy

Tags:

r

data.table

I have a complicated problem regarding an assignation by reference in a data.table column nested in another data.table. I was able to reproduce the behaviour in the reproducible example below.

I'm sorry, it's still long and need some time to be fully understand, but it's the shorter I was able to produce that point out my problem.

Let's say I create the following data.table named data_1 containing a single column of type data.table:

library(data.table)

set.seed(20200602L)

data_1 <- data.table(
  foo = replicate(5L, {
    data.table(
      bar = lapply(sample(3L, 5L, replace=TRUE), rpois, 1)
    )
  }, simplify=FALSE)
)

data_1[]
##              foo
##  1: <data.table>
##  2: <data.table>
##  3: <data.table>
##  4: <data.table>
##  5: <data.table>

One can explore the content of the column foo below :

data_1[, foo]
##  [[1]]
##       bar
##  1: 4,0,1
##  2:   0,2
##  3: 1,3,2
##  4:   1,1
##  5:     0
##  
##  [[2]]
##     bar
##  1:   2
##  2: 0,3
##  3:   0
##  4: 2,3
##  5: 0,0
##  
##  [[3]]
##       bar
##  1: 0,1,1
##  2: 1,2,1
##  3:   2,1
##  4:     1
##  5:     1
##  
##  [[4]]
##       bar
##  1:     1
##  2:   3,3
##  3:     0
##  4:   2,2
##  5: 0,0,0
##  
##  [[5]]
##     bar
##  1: 0,0
##  2: 0,0
##  3: 0,1
##  4: 2,1
##  5:   0

I would then like to create a function fun() that will add a column baz to each element in the column foo. This column baz would mirror the list in bar as shown below :

fun <- function(data) {

  data[, .(lapply(foo, function(x) {
    x[, baz:=lapply(bar, function(y) {
      rev(y)
    })]
  }))]

}

Before to apply that function to data_1, I'll copy it into data_2 because I need to keep the original intact.

data_2 <- copy(data_1)

invisible(fun(data_1))

data_1[, foo]
##  [[1]]
##       bar   baz
##  1: 4,0,1 1,0,4
##  2:   0,2   2,0
##  3: 1,3,2 2,3,1
##  4:   1,1   1,1
##  5:     0     0
##  
##  [[2]]
##     bar baz
##  1:   2   2
##  2: 0,3 3,0
##  3:   0   0
##  4: 2,3 3,2
##  5: 0,0 0,0
##  
##  [[3]]
##       bar   baz
##  1: 0,1,1 1,1,0
##  2: 1,2,1 1,2,1
##  3:   2,1   1,2
##  4:     1     1
##  5:     1     1
##  
##  [[4]]
##       bar   baz
##  1:     1     1
##  2:   3,3   3,3
##  3:     0     0
##  4:   2,2   2,2
##  5: 0,0,0 0,0,0
##  
##  [[5]]
##     bar baz
##  1: 0,0 0,0
##  2: 0,0 0,0
##  3: 0,1 1,0
##  4: 2,1 1,2
##  5:   0   0

One could double-check that data_2 is still intact :

data_2[, foo]
##  [[1]]
##       bar
##  1: 4,0,1
##  2:   0,2
##  3: 1,3,2
##  4:   1,1
##  5:     0
##  
##  [[2]]
##     bar
##  1:   2
##  2: 0,3
##  3:   0
##  4: 2,3
##  5: 0,0
##  
##  [[3]]
##       bar
##  1: 0,1,1
##  2: 1,2,1
##  3:   2,1
##  4:     1
##  5:     1
##  
##  [[4]]
##       bar
##  1:     1
##  2:   3,3
##  3:     0
##  4:   2,2
##  5: 0,0,0
##  
##  [[5]]
##     bar
##  1: 0,0
##  2: 0,0
##  3: 0,1
##  4: 2,1
##  5:   0

Up to that point, everything looks fine. However, let's say I change my mind and I want to apply the function fun() to data_2 as well. I would have thought that it would have work the same as it did for data_1. Unfortunately, it's not :

invisible(fun(data_2))
##  Warning messages:
##  1: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  2: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  3: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  4: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
##  5: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
##    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

data_2[, foo]
##  [[1]]
##       bar
##  1: 4,0,1
##  2:   0,2
##  3: 1,3,2
##  4:   1,1
##  5:     0
##  
##  [[2]]
##     bar
##  1:   2
##  2: 0,3
##  3:   0
##  4: 2,3
##  5: 0,0
##  
##  [[3]]
##       bar
##  1: 0,1,1
##  2: 1,2,1
##  3:   2,1
##  4:     1
##  5:     1
##  
##  [[4]]
##       bar
##  1:     1
##  2:   3,3
##  3:     0
##  4:   2,2
##  5: 0,0,0
##  
##  [[5]]
##     bar
##  1: 0,0
##  2: 0,0
##  3: 0,1
##  4: 2,1
##  5:   0

Can someone explain me why and maybe point me a way to solve the problem?


References

sessionInfo()
##  R version 4.0.0 (2020-04-24)
##  Platform: x86_64-pc-linux-gnu (64-bit)
##  Running under: SUSE Linux Enterprise Server 12 SP5
##  
##  Matrix products: default
##  BLAS:   /apps/R-4.0.0/lib/libRblas.so
##  LAPACK: /apps/R-4.0.0/lib/libRlapack.so
##  
##  locale:
##   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
##  
##  attached base packages:
##  [1] stats     graphics  grDevices utils     datasets  methods   base     
##  
##  other attached packages:
##  [1] data.table_1.12.8
##  
##  loaded via a namespace (and not attached):
##  [1] compiler_4.0.0 tools_4.0.0 
like image 948
J.P. Le Cavalier Avatar asked Jun 03 '20 03:06

J.P. Le Cavalier


Video Answer


1 Answers

The .internal.selfref is not being updated by copy for the constituent data.tables:

all.equal(
  lapply(data_1$foo, attr, '.internal.selfref'), 
  lapply(data_2$foo, attr, '.internal.selfref')
)
# [1] TRUE

This needs to be updated; you can fix the issue by running alloc.col on the copied data.tables:

data_2 = copy(data_1)
# also possible to do lapply(foo, copy), but this should be slower
data_2[ , foo := lapply(foo, alloc.col)]

invisible(fun(data_1))

invisible(fun(data_2))
like image 122
MichaelChirico Avatar answered Nov 19 '22 00:11

MichaelChirico