I have a complicated problem regarding an assignation by reference in a data.table
column nested in another data.table
. I was able to reproduce the behaviour in the reproducible example below.
I'm sorry, it's still long and need some time to be fully understand, but it's the shorter I was able to produce that point out my problem.
Let's say I create the following data.table
named data_1
containing a single column of type data.table
:
library(data.table)
set.seed(20200602L)
data_1 <- data.table(
foo = replicate(5L, {
data.table(
bar = lapply(sample(3L, 5L, replace=TRUE), rpois, 1)
)
}, simplify=FALSE)
)
data_1[]
## foo
## 1: <data.table>
## 2: <data.table>
## 3: <data.table>
## 4: <data.table>
## 5: <data.table>
One can explore the content of the column foo
below :
data_1[, foo]
## [[1]]
## bar
## 1: 4,0,1
## 2: 0,2
## 3: 1,3,2
## 4: 1,1
## 5: 0
##
## [[2]]
## bar
## 1: 2
## 2: 0,3
## 3: 0
## 4: 2,3
## 5: 0,0
##
## [[3]]
## bar
## 1: 0,1,1
## 2: 1,2,1
## 3: 2,1
## 4: 1
## 5: 1
##
## [[4]]
## bar
## 1: 1
## 2: 3,3
## 3: 0
## 4: 2,2
## 5: 0,0,0
##
## [[5]]
## bar
## 1: 0,0
## 2: 0,0
## 3: 0,1
## 4: 2,1
## 5: 0
I would then like to create a function fun()
that will add a column baz
to each element in the column foo
. This column baz
would mirror the list in bar
as shown below :
fun <- function(data) {
data[, .(lapply(foo, function(x) {
x[, baz:=lapply(bar, function(y) {
rev(y)
})]
}))]
}
Before to apply that function to data_1
, I'll copy it into data_2
because I need to keep the original intact.
data_2 <- copy(data_1)
invisible(fun(data_1))
data_1[, foo]
## [[1]]
## bar baz
## 1: 4,0,1 1,0,4
## 2: 0,2 2,0
## 3: 1,3,2 2,3,1
## 4: 1,1 1,1
## 5: 0 0
##
## [[2]]
## bar baz
## 1: 2 2
## 2: 0,3 3,0
## 3: 0 0
## 4: 2,3 3,2
## 5: 0,0 0,0
##
## [[3]]
## bar baz
## 1: 0,1,1 1,1,0
## 2: 1,2,1 1,2,1
## 3: 2,1 1,2
## 4: 1 1
## 5: 1 1
##
## [[4]]
## bar baz
## 1: 1 1
## 2: 3,3 3,3
## 3: 0 0
## 4: 2,2 2,2
## 5: 0,0,0 0,0,0
##
## [[5]]
## bar baz
## 1: 0,0 0,0
## 2: 0,0 0,0
## 3: 0,1 1,0
## 4: 2,1 1,2
## 5: 0 0
One could double-check that data_2
is still intact :
data_2[, foo]
## [[1]]
## bar
## 1: 4,0,1
## 2: 0,2
## 3: 1,3,2
## 4: 1,1
## 5: 0
##
## [[2]]
## bar
## 1: 2
## 2: 0,3
## 3: 0
## 4: 2,3
## 5: 0,0
##
## [[3]]
## bar
## 1: 0,1,1
## 2: 1,2,1
## 3: 2,1
## 4: 1
## 5: 1
##
## [[4]]
## bar
## 1: 1
## 2: 3,3
## 3: 0
## 4: 2,2
## 5: 0,0,0
##
## [[5]]
## bar
## 1: 0,0
## 2: 0,0
## 3: 0,1
## 4: 2,1
## 5: 0
Up to that point, everything looks fine. However, let's say I change my mind and I want to apply the function fun()
to data_2
as well. I would have thought that it would have work the same as it did for data_1
. Unfortunately, it's not :
invisible(fun(data_2))
## Warning messages:
## 1: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
## Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
## 2: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
## Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
## 3: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
## Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
## 4: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
## Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
## 5: In `[.data.table`(x, , `:=`(baz, lapply(bar, function(y) { :
## Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
data_2[, foo]
## [[1]]
## bar
## 1: 4,0,1
## 2: 0,2
## 3: 1,3,2
## 4: 1,1
## 5: 0
##
## [[2]]
## bar
## 1: 2
## 2: 0,3
## 3: 0
## 4: 2,3
## 5: 0,0
##
## [[3]]
## bar
## 1: 0,1,1
## 2: 1,2,1
## 3: 2,1
## 4: 1
## 5: 1
##
## [[4]]
## bar
## 1: 1
## 2: 3,3
## 3: 0
## 4: 2,2
## 5: 0,0,0
##
## [[5]]
## bar
## 1: 0,0
## 2: 0,0
## 3: 0,1
## 4: 2,1
## 5: 0
Can someone explain me why and maybe point me a way to solve the problem?
References
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: SUSE Linux Enterprise Server 12 SP5
##
## Matrix products: default
## BLAS: /apps/R-4.0.0/lib/libRblas.so
## LAPACK: /apps/R-4.0.0/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.12.8
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.0 tools_4.0.0
The .internal.selfref
is not being updated by copy
for the constituent data.table
s:
all.equal(
lapply(data_1$foo, attr, '.internal.selfref'),
lapply(data_2$foo, attr, '.internal.selfref')
)
# [1] TRUE
This needs to be updated; you can fix the issue by running alloc.col
on the copied data.table
s:
data_2 = copy(data_1)
# also possible to do lapply(foo, copy), but this should be slower
data_2[ , foo := lapply(foo, alloc.col)]
invisible(fun(data_1))
invisible(fun(data_2))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With