I have 32GB of ram on this machine, but I can get R killed faster than anybody ;)
The goal here is to achieve an rbind()
of two data.tables using functions that make use of data.table's efficiency.
input:
rm(list=ls())
gc()
output:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1604987 85.8 2403845 128.4 2251281 120.3
Vcells 3019405 23.1 537019062 4097.2 468553954 3574.8
input:
tmp.table <- data.table(X1=sample(1:7,4096000,replace=TRUE),
X2=as.factor(sample(1:2,4096000,replace=TRUE)),
X3=sample(1:1000,4096000,replace=TRUE),
X4=sample(1:256,4096000,replace=TRUE),
X5=sample(1:16,4096000,replace=TRUE),
X6=rnorm(4096000))
setkey(tmp.table,X1,X2,X3,X4,X5,X6)
join.table <- data.table(X1 = integer(), X2 = factor(),
X3 = integer(), X4=integer(),
X5 = integer(), X6 = numeric())
setkey(join.table,X1,X2,X3,X4,X5,X6)
tables()
output:
NAME NROW MB COLS KEY
[1,] join.table 0 1 X1,X2,X3,X4,X5,X6 X1,X2,X3,X4,X5,X6
[2,] tmp.table 4,096,000 110 X1,X2,X3,X4,X5,X6 X1,X2,X3,X4,X5,X6
Total: 111MB
input:
join.table <- merge(join.table,tmp.table,all.y=TRUE)
output:
Ha! Nope. RStudio restarts the session.
What's going on here? Explicitly setting the factor levels in join.table
had no effect. rbind()
instead of merge()
didn't help--exact same behavior. I have done way more complicated and bulky things related to this data without any problems.
$platform
[1] "x86_64-pc-linux-gnu"
$arch
[1] "x86_64"
$os
[1] "linux-gnu"
$system
[1] "x86_64, linux-gnu"
$version.string
[1] "R version 3.0.2 (2013-09-25)"
$nickname
[1] "Frisbee Sailing"
> rstudio::versionInfo()
$version
[1] ‘99.9.9’
$mode
[1] "server"
Data.table is version 1.8.11.
o
rbindlist
with at least one factor column along with the presence of at least one emptydata.table
resulted in segfault (or in linux/mac reported an error related to hash tables). This is now fixed, #5355. Thanks to Trevor Alexander for reporting on SO (and mnel for filing the bug report): merging really not that large data.tables immediately results in R being killed
This can be reproduced with a single row data.table
with a factor
column and a zero-row data.table with a factor column.
library(data.table)
A <- data.table(x=factor(1), key='x')
B <- data.table(x=factor(), key='x')
merge(B, A, all.y=TRUE)
# Rstudio -> R encountered fatal error
# R Gui -> R for windoze GUI has stopped working
Using debugonce(data.table:::merge.data.table)
this can be traced to the line rbind(dt,yy)
which the equivalent of
rbind(B,A)
which, if you run it, will give the same error.
This has been reported to the package authors as issue #5355
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With