Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing behavior for closure stored in data.table between R 3.4.3 and R 3.6.0

Tags:

r

data.table

I noticed the following peculiar behavior when I upgraded from R 3.4.3 to R 3.6.0 (both were using data.table 1.12.6). In 3.4.3 the code below leads to the all.equal statement being TRUE, but in 3.6.0 there is a mean relative difference that comes from the fact that even though we are trying to access the approxfun calculated from group "a", the values from group "b" are used (probably somehow due to lazy evaluation). In 3.6.0, this issue can be solved by adding a copy statement in the calls to approxfun based on this question: Handling of closures in data.table

The fascinating thing to me is that I do not get an error in 3.4.3. Any idea what changed?

library(data.table)
data <- data.table(
  group = c(rep("a", 4), rep("b", 4)),
  x = rep(c(.02, .04, .12, .21), 2),
  y = c(
    0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)

dtFuncs <- data[ , list(
  func = list(stats::approxfun(x, y, rule = 2))
), by = group]

f <- function(group, x) {
  dtResults <- CJ(group = group, x = x)
  dtResults <- dtResults[ , {
   .g <- group
    f2 <- dtFuncs[group == .g, func][[1]]
    list(x = x, y = f2(x))
  }, by = group] 
  dtResults
}

x0 <- .07
g <- "a"
all.equal(
  with(data[group == g], approx(x, y, x0, rule = 2)$y),
  f(group = g, x = x0)$y
)
like image 385
rlh2 Avatar asked Nov 23 '19 23:11

rlh2


1 Answers

After running git bisect on the r-source, I was able to deduce that it was this commit that caused the behavior: https://github.com/wch/r-source/commit/adcf18b773149fa20f289f2c8f2e45e6f7b0dbfe

What fundamentally happened was that in the case where x's were ordered in approxfun, an internal copy was no longer made. If the data had been randomly sorted, the code would have continued to work! (see snippet below)

Lesson for me is that its probably best not to mix complicated objects with data.table as the same environment is used over and over for each "by" group (or being very deliberate with data.table::copy)

## should be run under R > 3.6.0 to see disparity
library(data.table)

## original sorted x (does not work)
data <- data.table(
  group = c(rep("a", 4), rep("b", 4)),
  x = rep(c(.02, .04, .12, .21), 2),
  y = c(
    0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)

dtFuncs <- data[ , {
    print(environment())
    list(
        func = list(stats::approxfun(x, y, rule = 2))
    )
}, by = group]

f <- function(group, x) {
  dtResults <- CJ(group = group, x = x)
  dtResults <- dtResults[ , {
   .g <- group
    f2 <- dtFuncs[group == .g, func][[1]]
    list(x = x, y = f2(x))
  }, by = group] 
  dtResults
}

get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))

x0 <- .07
g <- "a"
all.equal(
  with(data[group == g], approx(x, y, x0, rule = 2)$y),
  f(group = g, x = x0)$y
)

## unsorted x (works)
data <- data.table(
  group = c(rep("a", 4), rep("b", 4)),
  x = rep(c(.02, .04, .12, .21), 2),
  y = c(
    0.0122, 0.01231, 0.01325, 0.01374, 0.01218, 0.01229, 0.0133, 0.01379)
)
set.seed(10)
data <- data[sample(1:.N, .N)]
dtFuncs <- data[ , {
    print(environment())
    list(
        func = list(stats::approxfun(x, y, rule = 2))
    )
}, by = group]

f <- function(group, x) {
  dtResults <- CJ(group = group, x = x)
  dtResults <- dtResults[ , {
   .g <- group
    f2 <- dtFuncs[group == .g, func][[1]]
    list(x = x, y = f2(x))
  }, by = group] 
  dtResults
}

get("y", environment(dtFuncs$func[[1]]))
get("y", environment(dtFuncs$func[[2]]))

x0 <- .07
g <- "a"
all.equal(
  with(data[group == g], approx(x, y, x0, rule = 2)$y),
  f(group = g, x = x0)$y
)

## better approach: maybe safer to avoid mixing objects treated by reference
## (data.table & closures) all together...
fList <- lapply(split(data, by = "group"), function(x){
    with(x, stats::approxfun(x, y, rule = 2))
})
fList
fList[[1]](.07) != fList[[2]](.07)
like image 137
rlh2 Avatar answered Oct 13 '22 06:10

rlh2