Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling invalid selfref in R data.table when passing by reference to a function

Tags:

r

data.table

My group writes a lot of code using data.table and we occasionally get bitten by the 'Invalid .internal.selfref detected and fixed by taking a copy of the whole table ...' warning. This behaviour can break our code when a data table is passed by reference to a function and I am trying to figure out how to work around it.

Suppose I have a function which adds a column to a data.table as a side effect -- note the original data.table is not returned.

foo <- function(mydt){
   mydt[, c := c("a", "b")]
   return(123)
)

> x<- data.table(a=c(1,2), b=c(3,4))
> foo(x) 
[1] 123
> x
   a b c
1: 1 3 a
2: 2 4 b

x has been updated with the new column. This is the desired behavior.

Now suppose something happens that breaks the internal self-ref in x:

> x<- data.table(a=c(1,2), b=c(3,4))
> x[["a"]] <- c(7,8)
> foo(x)
[1] 123
Warning message:
In `[.data.table`(mydt, , `:=`(c, c("a", "b"))) :
Invalid .internal.selfref detected and fixed by taking a copy ...

 > x
    a b
 1: 7 3
 2: 8 4

I understand what happened (mostly). The [["a"]] construction is not data.table friendly; x was converted to a data frame and then back to a data table, which somehow messed up the internal workings. Then inside foo(), during the reference operation of adding a column, this problem was detected, and a copy of mydt was made; the new column 'c' was added to mydt. However, that copy operation severed the pass-by-reference relationship between x and mydt, so the additional columns are not part of x.

The function foo() is going to be used by different people and it will be difficult to protect against invalid internal selfref situations. Someone out there might easy do something like x[["a"]] which would lead to invalid input. I'm trying to figure out how to handle this from inside foo.

So far I have this idea, at the beginning of foo():

if(!data.table:::selfrefok(mydt)) stop("mydt is corrupt.")

That at least gives us a chance of spotting the problem, but it's not very friendly to the users of foo(), because the ways in which these inputs can get corrupted can be pretty opaque. Ideally I would like to be able to correct for corrupted input and maintain the desired functionality of foo(). But I can't see how, unless I restructure my code so that foo returns mydt and assigns it to x in the calling scope, which is possible but not ideal. Any ideas?

like image 375
pteehan Avatar asked Jul 16 '14 23:07

pteehan


2 Answers

You should read the whole of the warning....

Then you would notice

At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table.

[[<- is similar to names<- and attr<- in that it will create a copy.

You can ensure that the by-reference behaviour is to construct the call with substitute, and then evaluate in the parent frame

foo <- function(x) {
   l <- substitute(x[,c := 'a'], as.list(match.call())['x']); 
   eval.parent(l)
   return(123)}

xx<- data.table(a=c(1,2), b=c(3,4))
xx[["a"]] <- c(7,8)
foo(xx)
# [1] 123
# Warning message: .....

# but it now works!
xx
#    a b c
# 1: 7 3 a
# 2: 8 4 a

The warning remains but the function works as desired.

like image 200
mnel Avatar answered Oct 06 '22 01:10

mnel


@pteehan, great question! In my mind, a much more cleaner fix would be to restore the over-allocation during the assignment step itself, with a warning which basically says "don't do it!".

The way to do that would be through [[<-.data.table method, which doesn't exist currently. Unless I'm missing something, it'd be great addition, whose purpose is not to encourage using it, but to catch cases like this and direct people to the right usage (with a warning), and at the same time restoring the over-allocation.

Roughly:

`[[<-.data.table` <- function(x, i, j, value) {
    warning("Don't do this. Use := instead.")
    call = sys.call()
    call[[1L]] = `[[<-.data.frame`
    ans = copy(eval(call, envir=parent.frame()))
}

foo <- function(mydt) {
   mydt[, c := c("a", "b")]
   return(123)
}
x <- data.table(a = c(1,2), b = c(3,4))

x[["a"]] <- c(7,8)
# Warning message:
# In `[[<-.data.table`(`*tmp*`, "a", value = c(7, 8)) :
#   Don't do this. Use := instead.

data.table:::selfrefok(x)
# [1] 1

foo(x)
# [1] 123

x
#    a b c
# 1: 7 3 a
# 2: 8 4 b

Something along these lines should provide a cleaner solution I believe. Maybe this should get implemented.

PS: This post explains in detail as to why the warning in your question occurs.

like image 24
Arun Avatar answered Oct 06 '22 02:10

Arun