I am of course aware that the main purpose of data.table
object is to allow fast subsetting/grouping etc., and it makes much more sense to have one big data.table
and subset it (very efficiently) than having a lot of (possibly small) data.table
objects.
That being said, I recently created a script that instantiates a lot of data.table
objects and I noticed that the performances decrease as the number of in-memory data.table's
grows.
Here's an example of what I mean :
n <- 10000
# create a list containing 10k data.frame's
system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ data.frame(A=1:10,B=1:10,ID=i)}),gcFirst=T)
# user system elapsed
# 2.24 0.00 2.23
# create a list containing 10k data.table's
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){ data.table(A=1:10,B=1:10,ID=i)}),gcFirst=T)
# user system elapsed
# 5.49 0.01 5.53
n <- 80000
# create a list containing 80k data.frame's
system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ data.frame(A=1:10,B=1:10,ID=i)}),gcFirst=T)
# user system elapsed
# 19.42 0.01 19.53
# create a list containing 80k data.table's
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){ data.table(A=1:10,B=1:10,ID=i)}),gcFirst=T)
# user system elapsed
# 147.03 0.10 147.41
As you can notice, while data.frame's
creation time grows linearly with the number of data.frame's
created, data.table
complexity seems more than linear.
Is this expected?
Has this something to do with the list of in-memory tables (the one that you can see by calling tables()
function) ?
Environment :
R version 3.1.2 (on Windows)
data.table 1.9.4
EDIT :
As pointed out by @Arun in the comments, as.data.table(...)
seems to behave similarly to data.frame(...)
. In fact, paradoxically as.data.table(data.frame(...))
is faster than data.table(...)
and time grows linearly with the number of objects, e.g. :
n <- 10000
# create a list containing 10k data.table's using as.data.table
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){ as.data.table(data.frame(A=1:10,B=1:10,ID=i))}),gcFirst=T)
# user system elapsed
# 5.04 0.01 5.04
n <- 80000
# create a list containing 80k data.table's using as.data.table
system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ as.data.table(data.frame(A=1:10,B=1:10,ID=i))}),gcFirst=T)
# user system elapsed
# 44.94 0.12 45.28
You should use setDT
:
n <- 80000
system.time(lotsofDTs <- lapply(1:n,FUN=function(i){setDT(list(A=1:10,B=1:10,ID=matrix(i,10)))}),gcFirst=T)
# user system elapsed
# 6.75 0.28 7.17
system.time(lotsofDFs <- lapply(1:n,FUN=function(i){ data.frame(A=1:10,B=1:10,ID=i)}),gcFirst=T)
# user system elapsed
# 32.58 1.40 34.22
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With