I am definitely confused on why accessing a data.table by row index is slower than data.frame. Any suggestions how i can access each row of data.table sequentially in loop that is faster?
m = matrix(1L, nrow=100000, ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
identical(DF[100, ], DT[100, ])
[1] FALSE
> all(DF[100, ], DT[100, ])
[1] TRUE
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
5.440 0.000 5.451
R> system.time(for (i in 1:1000) DF[i,])
user system elapsed
2.757 0.000 2.784
A data.table
query has more arguments (and it does more) so the small overhead of DT[...]
is larger than DF[...]
. This overhead adds up if you loop it. The intended use of data.table
is to have it execute a large complex operation few times, rather than small trivial calculations multiple times. So let's reformulate your test:
> system.time(DT[seq(len=nrow(m)),])
user system elapsed
0.08 0.02 0.09
> system.time(DF[seq(len=nrow(m)),])
user system elapsed
0.08 0.05 0.13
Here, they are about the same. Since we only have one DT call, the overhead isn't that apparent because the overhead is only executed once. In your case you executed it 100K times (unnecessarily, I might add). If you are using data.table
and you are making calls to it thousands of times, you are probably using it wrong. There almost certainly is a way to reformulate so you can have just one or a few data.table
calls that do the same thing.
Also, note that even my reformulated test here is pretty trivial, which is why data.table
performs comparably to data.frame
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With