I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.)
However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare:
First dplyr.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DF = tibble(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z))
#> # A tibble: 10 x 2
#> y `mean(z)`
#> * <chr> <dbl>
#> 1 A -0.00336
#> 2 B -0.00702
#> 3 C 0.00291
#> 4 D -0.00430
#> 5 E -0.00705
#> 6 F -0.00568
#> 7 G -0.00344
#> 8 H 0.000553
#> 9 I -0.00168
#> 10 J 0.00661
bench::bench_process_memory()
#> current max
#> 585MB 611MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Then data.table.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 948.47MB 1.17GB
Created on 2020-04-22 by the reprex package (v0.3.0)
So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that @Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.)
Any ideas, or am I just missing something obvious?
P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a bench_process_memory()
function that also tracks memory outside of R’s GC heap, which is why I use it here.
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 knitr_1.28 magrittr_1.5 tidyselect_1.0.0
#> [5] R6_2.4.1 rlang_0.4.5.9000 stringr_1.4.0 highr_0.8
#> [9] tools_3.6.3 xfun_0.13 htmltools_0.4.0 ellipsis_0.3.0
#> [13] yaml_2.2.1 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0
#> [17] crayon_1.3.4 purrr_0.3.4 vctrs_0.2.99.9011 glue_1.4.0
#> [21] evaluate_0.14 rmarkdown_2.1 stringi_1.4.6 compiler_3.6.3
#> [25] pillar_1.4.3 generics_0.0.2 pkgconfig_2.0.3
Created on 2020-04-22 by the reprex package (v0.3.0)
table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .
Memory Usage (Efficiency) data. table is the most efficient when filtering rows. dplyr is far more efficient when summarizing by group while data. table was the least efficient.
The tidyverse, for example, emphasizes readability and flexibility, which is great when I need to write scaleable code that others can easily read. data. table, on the other hand, is lightening fast and very concise, so you can develop quickly and run super fast code, even when datasets get fairly large.
Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.
UPDATE: Following @jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.
dplyr
$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user: 0.526 s
Child sys : 0.033 s
Child wall: 0.455 s
Child high-water RSS : 128952 KiB
Recursive and acc. high-water RSS+CACHE : 118516 KiB
data.table
$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user: 0.510 s
Child sys : 0.056 s
Child wall: 0.464 s
Child high-water RSS : 129032 KiB
Recursive and acc. high-water RSS+CACHE : 118320 KiB
Bottom line: Accurately measuring memory usage from within R is complicated.
I'll leave my original answer below because I think it still has value.
ORIGINAL ANSWER:
Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
setDTthreads(1) ## TURN OFF MULTITHREADING
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 589MB 612MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With