Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting a random internal selfref error in data.table for R

I love data.table, it's fast and intuitive, what could be better? Alas, here's my problem: when referring to a data.table within a foreach() loop (using the doMC implementation) I will occasionally get the following error: EXAMPLE IN APPENDIX

Error in { : 
  Internal error: .internal.selfref prot is not itself an extptr

One of the annoying problems here is that I can't get it to reproduce with any consistency, but it will happen during some long (several hrs) tasks, so I want to make sure it never happens, if possible.

Since I refer to the same data.table, DT, in each loop, I tried running the following at the beginning of each loop:

setattr(DT,".internal.selfref",NULL)   

...to remove the invalid/corrupted self ref attribute. This works and the internal selfref error no longer occurs. It's a workaround, though.

Any ideas for addressing the root problem?

Many thanks for any help!

Eric

Appendix: Abbreviated R Session Info to confirm latest versions:

R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
other attached packages:
 [1] data.table_1.8.8  doMC_1.3.0

Example using simulated data -- you may have to run the history() function many times (like, hundreds) to get the error:

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Load packages and Prepare Data
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
require(data.table)
##this is the package we use for multicore
require(doMC)
##register n-2 of your machine's cores
registerDoMC(multicore:::detectCores()-2) 

## Build simulated data
value.a <- runif(500,0,1)
value.b <- 1-value.a
value <- c(value.a,value.b)
answer.opt <- c(rep("a",500),rep("b",500))
answer.id <- rep( 6000:6499 , 2)
question.id <- rep( sample(c(1001,1010,1041,1121,1124),500,replace=TRUE) ,2)
date <- rep( (Sys.Date() - sample.int(150, size=500, replace=TRUE)) , 2)
user.id <- rep( sample(250:350, size=500, replace=TRUE) ,2)
condition <- substr(as.character(user.id),1,1)
condition[which(condition=="2")] <- "x"
condition[which(condition=="3")] <- "y"

##Put everything in a data.table
DT.full <- data.table(user.id = user.id,
                      answer.opt = answer.opt,
                      question.id = question.id,
                      date = date,
                      answer.id = answer.id,
                      condition = condition,
                      value = value)

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Daily Aggregation Function
##
##a basic function that aggregates all the values from
##all users for every question on a given day:
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
each.day <- function(val.date){
  DT <- DT.full[ date < val.date ]

  #count the number of updates per user (for weighting)
  setkey(DT, question.id, user.id)
  DT <- DT[ DT[answer.opt=="a",length(value),by="question.id,user.id"] ]
  setnames(DT, "V1", "freq")

  #retain only the most recent value from each user on each question
  setkey(DT, question.id, user.id, answer.id)
  DT <- DT[ DT[ ,answer.id == max(answer.id), by="question.id,user.id", ][[3]] ]

  #now get a weighted mean (with freq) of the value for each question
  records <- lapply(unique(DT$question.id), function(q.id) {
    DT <- DT[ question.id == q.id ]
    probs <- DT[ ,weighted.mean(value,freq), by="answer.opt" ]
    return(data.table(q.id = rep(q.id,nrow(probs)),
                      ans.opt = probs$answer.opt,
                      date = rep(val.date,nrow(probs)),
                      value = probs$V1))
  })
  return(do.call("rbind",records))
}

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## foreach History Function 
##
##to aggregate accross many days quickly
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
history <- function(start, end){
  #define a sequence of dates
  date.seq <- seq(as.Date(start),as.Date(end),by="day")

  #now run a foreach to get the history for each date
  hist <- foreach(day = date.seq,  .combine = "rbind") %dopar% {
    #setattr(DT,".internal.selfref",NULL) #resolves occasional internal selfref error
    each.day(val.date = day)
  }
}

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Examples
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

##aggregate only one day
each.day(val.date = "2012-12-13")

##generate a history
hist.example <- history (start = "2012-11-01", end = Sys.Date())
like image 496
theEricStone Avatar asked Mar 11 '13 15:03

theEricStone


1 Answers

Thanks for reporting and all the help in finding it! Now fixed in v1.8.11. From NEWS :

In long running computations where data.table is called many times repetitively, the following error could sometimes occur, #2647 :
Internal error: .internal.selfref prot is not itself an extptr
Fixed. Thanks to theEricStone, StevieP and JasonB for (difficult) reproducible examples.

Possibly related is a memory leak in grouping, which is also now fixed.

Long outstanding (usually small) memory leak in grouping fixed, #2648. When the last group is smaller than the largest group, the difference in those sizes was not being released. Also in non-trivial aggregations where each group returns a different number of rows. Most users run a grouping query once and will never have noticed these, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered. Tests added. Thanks to many including vc273 and Y T.
Memory leak in data.table grouped assignment by reference
Slow memory leak in data.table when returning named lists in j (trying to reshape a data.table)

like image 147
Matt Dowle Avatar answered Oct 25 '22 10:10

Matt Dowle