I love data.table, it's fast and intuitive, what could be better?
Alas, here's my problem: when referring to a data.table
within a foreach()
loop (using the doMC
implementation) I will occasionally get the following error:
EXAMPLE IN APPENDIX
Error in { :
Internal error: .internal.selfref prot is not itself an extptr
One of the annoying problems here is that I can't get it to reproduce with any consistency, but it will happen during some long (several hrs) tasks, so I want to make sure it never happens, if possible.
Since I refer to the same data.table
, DT
, in each loop, I tried running the following at the beginning of each loop:
setattr(DT,".internal.selfref",NULL)
...to remove the invalid/corrupted self ref attribute. This works and the internal selfref error no longer occurs. It's a workaround, though.
Any ideas for addressing the root problem?
Many thanks for any help!
Eric
Appendix: Abbreviated R Session Info to confirm latest versions:
R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
other attached packages:
[1] data.table_1.8.8 doMC_1.3.0
Example using simulated data -- you may have to run the history()
function many times (like, hundreds) to get the error:
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Load packages and Prepare Data
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
require(data.table)
##this is the package we use for multicore
require(doMC)
##register n-2 of your machine's cores
registerDoMC(multicore:::detectCores()-2)
## Build simulated data
value.a <- runif(500,0,1)
value.b <- 1-value.a
value <- c(value.a,value.b)
answer.opt <- c(rep("a",500),rep("b",500))
answer.id <- rep( 6000:6499 , 2)
question.id <- rep( sample(c(1001,1010,1041,1121,1124),500,replace=TRUE) ,2)
date <- rep( (Sys.Date() - sample.int(150, size=500, replace=TRUE)) , 2)
user.id <- rep( sample(250:350, size=500, replace=TRUE) ,2)
condition <- substr(as.character(user.id),1,1)
condition[which(condition=="2")] <- "x"
condition[which(condition=="3")] <- "y"
##Put everything in a data.table
DT.full <- data.table(user.id = user.id,
answer.opt = answer.opt,
question.id = question.id,
date = date,
answer.id = answer.id,
condition = condition,
value = value)
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Daily Aggregation Function
##
##a basic function that aggregates all the values from
##all users for every question on a given day:
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
each.day <- function(val.date){
DT <- DT.full[ date < val.date ]
#count the number of updates per user (for weighting)
setkey(DT, question.id, user.id)
DT <- DT[ DT[answer.opt=="a",length(value),by="question.id,user.id"] ]
setnames(DT, "V1", "freq")
#retain only the most recent value from each user on each question
setkey(DT, question.id, user.id, answer.id)
DT <- DT[ DT[ ,answer.id == max(answer.id), by="question.id,user.id", ][[3]] ]
#now get a weighted mean (with freq) of the value for each question
records <- lapply(unique(DT$question.id), function(q.id) {
DT <- DT[ question.id == q.id ]
probs <- DT[ ,weighted.mean(value,freq), by="answer.opt" ]
return(data.table(q.id = rep(q.id,nrow(probs)),
ans.opt = probs$answer.opt,
date = rep(val.date,nrow(probs)),
value = probs$V1))
})
return(do.call("rbind",records))
}
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## foreach History Function
##
##to aggregate accross many days quickly
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
history <- function(start, end){
#define a sequence of dates
date.seq <- seq(as.Date(start),as.Date(end),by="day")
#now run a foreach to get the history for each date
hist <- foreach(day = date.seq, .combine = "rbind") %dopar% {
#setattr(DT,".internal.selfref",NULL) #resolves occasional internal selfref error
each.day(val.date = day)
}
}
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Examples
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##aggregate only one day
each.day(val.date = "2012-12-13")
##generate a history
hist.example <- history (start = "2012-11-01", end = Sys.Date())
Thanks for reporting and all the help in finding it! Now fixed in v1.8.11. From NEWS :
In long running computations where data.table is called many times repetitively, the following error could sometimes occur, #2647 :
Internal error: .internal.selfref prot is not itself an extptr
Fixed. Thanks to theEricStone, StevieP and JasonB for (difficult) reproducible examples.
Possibly related is a memory leak in grouping, which is also now fixed.
Long outstanding (usually small) memory leak in grouping fixed, #2648. When the last group is smaller than the largest group, the difference in those sizes was not being released. Also in non-trivial aggregations where each group returns a different number of rows. Most users run a grouping query once and will never have noticed these, but anyone looping calls to grouping (such as when running in parallel, or benchmarking) may have suffered. Tests added. Thanks to many including vc273 and Y T.
Memory leak in data.table grouped assignment by reference
Slow memory leak in data.table when returning named lists in j (trying to reshape a data.table)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With