Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - slow performance in creating lots of data.table objects

I am of course aware that the main purpose of data.table object is to allow fast subsetting/grouping etc., and it makes much more sense to have one big data.table and subset it (very efficiently) than having a lot of (possibly small) data.table objects.

That being said, I recently created a script that instantiates a lot of data.table objects and I noticed that the performances decrease as the number of in-memory data.table's grows.

Here's an example of what I mean :

n <- 10000
# create a list containing 10k data.frame's
system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ data.frame(A=1:10,B=1:10,ID=i)}),gcFirst=T)
#   user  system elapsed 
#   2.24    0.00    2.23 
# create a list containing 10k data.table's
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){ data.table(A=1:10,B=1:10,ID=i)}),gcFirst=T)
#   user  system elapsed 
#   5.49    0.01    5.53 
n <- 80000
# create a list containing 80k data.frame's
system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ data.frame(A=1:10,B=1:10,ID=i)}),gcFirst=T)
#   user   system elapsed
#   19.42    0.01   19.53
# create a list containing 80k data.table's
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){ data.table(A=1:10,B=1:10,ID=i)}),gcFirst=T)
#   user    system elapsed
#   147.03    0.10  147.41

As you can notice, while data.frame's creation time grows linearly with the number of data.frame's created, data.table complexity seems more than linear.

Is this expected?

Has this something to do with the list of in-memory tables (the one that you can see by calling tables() function) ?


Environment :

R version 3.1.2 (on Windows)
data.table 1.9.4


EDIT :

As pointed out by @Arun in the comments, as.data.table(...) seems to behave similarly to data.frame(...). In fact, paradoxically as.data.table(data.frame(...)) is faster than data.table(...) and time grows linearly with the number of objects, e.g. :

n <- 10000
# create a list containing 10k data.table's using as.data.table
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){ as.data.table(data.frame(A=1:10,B=1:10,ID=i))}),gcFirst=T)
#   user  system elapsed 
#   5.04    0.01    5.04 
n <- 80000
# create a list containing 80k data.table's using as.data.table
system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ as.data.table(data.frame(A=1:10,B=1:10,ID=i))}),gcFirst=T)
#   user   system elapsed
#   44.94    0.12   45.28
like image 981
digEmAll Avatar asked Jan 28 '15 22:01

digEmAll


1 Answers

You should use setDT :

n <- 80000
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){setDT(list(A=1:10,B=1:10,ID=matrix(i,10)))}),gcFirst=T)
#   user  system elapsed 
#   6.75    0.28    7.17

system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ data.frame(A=1:10,B=1:10,ID=i)}),gcFirst=T)
#   user  system elapsed 
#  32.58    1.40   34.22 
like image 200
filius_arator Avatar answered Oct 21 '22 05:10

filius_arator