Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merging really not that large data.tables immediately results in R being killed [closed]

I have 32GB of ram on this machine, but I can get R killed faster than anybody ;)

example

The goal here is to achieve an rbind() of two data.tables using functions that make use of data.table's efficiency.

input:

rm(list=ls())
gc()

output:

          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells 1604987 85.8    2403845  128.4   2251281  120.3
Vcells 3019405 23.1  537019062 4097.2 468553954 3574.8

input:

tmp.table <- data.table(X1=sample(1:7,4096000,replace=TRUE),
                           X2=as.factor(sample(1:2,4096000,replace=TRUE)),
                           X3=sample(1:1000,4096000,replace=TRUE),
                           X4=sample(1:256,4096000,replace=TRUE),
                           X5=sample(1:16,4096000,replace=TRUE),
                           X6=rnorm(4096000))

setkey(tmp.table,X1,X2,X3,X4,X5,X6)

join.table <- data.table(X1 = integer(), X2 = factor(), 
                         X3 = integer(), X4=integer(),
                         X5 = integer(), X6 = numeric())

setkey(join.table,X1,X2,X3,X4,X5,X6)

tables()

output:

     NAME            NROW  MB COLS              KEY              
[1,] join.table         0   1 X1,X2,X3,X4,X5,X6 X1,X2,X3,X4,X5,X6
[2,] tmp.table  4,096,000 110 X1,X2,X3,X4,X5,X6 X1,X2,X3,X4,X5,X6
Total: 111MB

input:

join.table <- merge(join.table,tmp.table,all.y=TRUE)

output:

Ha! Nope. RStudio restarts the session.

question

What's going on here? Explicitly setting the factor levels in join.table had no effect. rbind() instead of merge() didn't help--exact same behavior. I have done way more complicated and bulky things related to this data without any problems.

version info

$platform
[1] "x86_64-pc-linux-gnu"

$arch
[1] "x86_64"

$os
[1] "linux-gnu"

$system
[1] "x86_64, linux-gnu"

$version.string
[1] "R version 3.0.2 (2013-09-25)"

$nickname
[1] "Frisbee Sailing"

> rstudio::versionInfo()
$version
[1] ‘99.9.9’

$mode
[1] "server"

Data.table is version 1.8.11.

like image 577
bright-star Avatar asked Feb 06 '14 00:02

bright-star


1 Answers

Update: This has been fixed in commit 1123 of v1.8.11. From NEWS:

o rbindlist with at least one factor column along with the presence of at least one empty data.table resulted in segfault (or in linux/mac reported an error related to hash tables). This is now fixed, #5355. Thanks to Trevor Alexander for reporting on SO (and mnel for filing the bug report): merging really not that large data.tables immediately results in R being killed


This can be reproduced with a single row data.table with a factor column and a zero-row data.table with a factor column.

library(data.table)
A <- data.table(x=factor(1), key='x')
B <- data.table(x=factor(), key='x')
merge(B, A, all.y=TRUE)

# Rstudio -> R encountered fatal error
#  R Gui -> R for windoze GUI has stopped working

Using debugonce(data.table:::merge.data.table) this can be traced to the line rbind(dt,yy) which the equivalent of

rbind(B,A)

which, if you run it, will give the same error.

This has been reported to the package authors as issue #5355

like image 155
mnel Avatar answered Nov 16 '22 03:11

mnel