I just spent some time researching about data.table
in R and was wondering about the conditions under which I can expect the largest performance gains. Maybe the simple answer is when I have a large data.frame and often operate on subsets of this data.frame. When I just load data files and estimate models I can't expect much but many [
operations make the difference. Is that true and the only answer or what else should I consider? When does it start to matter? 10x5, 1,000x5, 1,000,000x5?
Edit: Some of the comments suggest that data.table
is often faster and, equally important, almost never slower. So it would also be good to know when not to use data.table
.
There are at least a few cases where data.table
shines:
data.table
is by-reference, this is massively faster.Here are some benchmarks: Benchmarking data.frame (base), data.frame(package dataframe) and data.table
One instance where data.table
is veeeery fast is in the split-apply-combine type of work which made plyr
famous. Say you have a data.frame
with the following data:
precipitation time station_id
23.3 1 A01
24.1 2 A01
26.1 1 A02
etc etc
When you need to average per station id, you can use a host of R functions, e.g. ave
, ddply
, or data.table
. If the number of unique elements in station_id
grows, data.table
scales really well, whilst e.g. ddply
get's really slow. More details, including an example, can be found in this post on my blog. This test suggests that speed increases of more than 150 fold are possible. This difference can probably be much bigger...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With