Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assignment via `:=` in a for loop (R data.table)

Tags:

r

data.table

I'm trying to assign some new variables within a for loop (I'm trying to create some variables with common structure, but which are subsample-dependent).

I've tried for the life of me to re-produce this error on sample data and I can't. Here's code that works & gets the gist of what I want to do:

DT <- data.table(
  id = rep(1:100, each = 20L),
  period = rep(-9:10, 100L),
  grp = rep(sample(4L, size = 100L, replace = TRUE), each = 20L),
  y = runif(2000, min=0, max=5), key = c("id", "period")
)
DT[ , x := cumsum(y), by = id]
DT2 <- DT[id %in% seq(1, 100, by=2)]
DT3 <- DT[id %in% seq(1, 100, by=3)]

for (dd in list(DT, DT2, DT3)){
  setkey(setkey(dd, grp)[dd[period==0, sum(x), by = grp], x_at_0_by_grp := V1], id, period)
}

This works fine--however, when I do this to my own code, it generates the Invalid .internal.selfref warning (and doesn't create the variable I want):

In [.data.table(setkey(dt, treatment), dt[posting_rel == 0, sum(current_balance), : Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.

In fact, when I subset my data to only those columns needed within the merge, it also works fine on my data (though doesn't save to the original data sets).

This suggests to me it's a problem with keying, but I'm explicitly setting the keys every step of the way. I'm completely lost on how to debug this from here because I can't get the error to repeat except on my full data set.

If I break out the operation into steps, the error arises at the merge step:

for (dd in list(DT, DT2, DT3)){
  dummy <- dd[period==0, sum(x), by = grp]
  setkey(dd, grp)
  dd[dummy, x_at_0_by_grp := V1] #***ERROR HERE***
  setkey(dd, id, period)
}

Quick update--also produces the error if I cast this with lapply instead of within a for loop.

Any ideas what on earth is going on here?


UPDATE: I've come up with a workaround by doing:

nnames <- c("dt", "dt2", "dt3")

dt_list <- list(DT, DT2, DT3)

for (ii in 1:3){
  dummy <- copy(dt_list[[ii]])
  dummy[ , x_at_0_by_grp := sum(x[period == 0]), by=grp]
  assign(nnames[ii], dummy)
}

Would still like to understand what's going on, and perhaps a better way of assigning variables iteratively in situations like this.

like image 561
MichaelChirico Avatar asked Nov 01 '22 07:11

MichaelChirico


1 Answers

With 20-30 criteria, keeping them outside of a list (with manual names like dt2, etc.) is too clunky, so I'll just assume you have them all in dt_list.

I suggest making tables with just the stat you're computing, and then rbinding them:

xxt <- rbindlist(lapply(1:length(dt_list),function(i) 
         dt_list[[i]][,list(cond=i,xx=sum(x[period==0])),by=grp]))

which creates

    grp cond       xx
 1:   1    1 623.3448
 2:   2    1 784.8438
 3:   4    1 699.2362
 4:   3    1 367.7196
 5:   1    2 323.6268
 6:   4    2 307.0374
 7:   2    2 447.0753
 8:   3    2 185.7377
 9:   1    3 275.4897
10:   4    3 243.0214
11:   2    3 149.6041
12:   3    3 166.3626

You can easily merge back if you really want those vars. For example, for dt2:

myi = 2
setkey(dt_list[[myi]],grp)[xxt[cond==myi,list(grp,xx)]]

This doesn't resolve the bug you're running into, but I think is a better approach.

like image 142
Frank Avatar answered Nov 08 '22 07:11

Frank