Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a memory leak in the data.table package in R 3.6.0?

in R 3.6.0 (Pre-release) only I have a memory leak in the data.table package. This happens on the CRAN version as well as on the GH version.

require(data.table)
n <- 2e6
df <- data.frame(a=rnorm(n),
                 b=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]),
                 c=factor(rbinom(n,5,prob=0.5),1:5,letters[1:5]))
dt <- setDT(df)
print(pryr::mem_used())
fff <- function(aref) {
  ff <- lapply(1:5, function(i) {
    dt2 <- dt[,list(sumA=sum(get(aref))),by=b][,c:=letters[i]]
    dt2
  })
  return(rbindlist(ff))
}
for(i in 1:10) {
  f <- fff("a")
  rm("f")
  gc()
  print(pryr::mem_used())
}
gc()
print(pryr::mem_used())

returns (3.6.0 only)

81.2 MB
81.2 MB
81.2 MB
184 MB
287 MB
390 MB
493 MB
596 MB
699 MB
802 MB

any ideas?

Both the call to "get" and the "by" appear to be necessary. The `[,c:=letters[i]] is NOT, but it makes the memory leak appear much faster.

My session info

> sessionInfo()
R Under development (unstable) (2018-05-10 r74708)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.3

loaded via a namespace (and not attached):
[1] compiler_3.6.0   pryr_0.1.4       magrittr_1.5     tools_3.6.0     
[5] Rcpp_0.12.16     stringi_1.1.7    codetools_0.2-15 stringr_1.3.0   
like image 294
pdb Avatar asked May 11 '18 19:05

pdb


1 Answers

Yay! A reproducible example. We've been struggling for a few weeks in this area. Your example looks extremely useful. Please join us on GitHub.

The current milestone (next release) is 1.11.4 and there are several related issues there. What made you think we didn't want you to raise an issue? Bullet point 3 of the issue template I guess. I've now changed those points to be clearer, I hope. You're a package developer having issues at-the-moment with as yet unreleased R 3.6.0 and recently released data.table, so that should be on GitHub.

enter image description here

like image 59
Matt Dowle Avatar answered Oct 19 '22 22:10

Matt Dowle