When is it worth using `data.table`? When can I expect the largest performance gains? [closed]




I just spent some time researching about data.table in R and was wondering about the conditions under which I can expect the largest performance gains. Maybe the simple answer is when I have a large data.frame and often operate on subsets of this data.frame. When I just load data files and estimate models I can't expect much but many [ operations make the difference. Is that true and the only answer or what else should I consider? When does it start to matter? 10x5, 1,000x5, 1,000,000x5?

Edit: Some of the comments suggest that data.table is often faster and, equally important, almost never slower. So it would also be good to know when not to use data.table.

2 Answers

There are at least a few cases where data.table shines:

  • Updating an existing dataset with new results. Because data.table is by-reference, this is massively faster.
  • Split-apply-combine type strategies with large numbers of groups to split over (as @PaulHiemstra's answer points out).
  • Doing almost anything to a truly large dataset.

Here are some benchmarks: Benchmarking data.frame (base), data.frame(package dataframe) and data.table

One instance where data.table is veeeery fast is in the split-apply-combine type of work which made plyr famous. Say you have a data.frame with the following data:

precipitation     time   station_id
23.3              1      A01
24.1              2      A01
26.1              1      A02
etc etc

When you need to average per station id, you can use a host of R functions, e.g. ave, ddply, or data.table. If the number of unique elements in station_id grows, data.table scales really well, whilst e.g. ddply get's really slow. More details, including an example, can be found in this post on my blog. This test suggests that speed increases of more than 150 fold are possible. This difference can probably be much bigger...

