I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.) However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare: First dplyr. <pre class="prettyprint lang-r prettyprint-override"><code>library(bench) library(dplyr, warn.conflicts = FALSE) library(data.table, warn.conflicts = FALSE) set.seed(123) DF = tibble(x = rep(1:10, times = 1e5), y = sample(LETTERS[1:10], 10e5, replace = TRUE), z = rnorm(1e6)) DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z)) #> # A tibble: 10 x 2 #> y `mean(z)` #> * <chr> <dbl> #> 1 A -0.00336 #> 2 B -0.00702 #> 3 C 0.00291 #> 4 D -0.00430 #> 5 E -0.00705 #> 6 F -0.00568 #> 7 G -0.00344 #> 8 H 0.000553 #> 9 I -0.00168 #> 10 J 0.00661 bench::bench_process_memory() #> current max #> 585MB 611MB </code></pre> Created on 2020-04-22 by the reprex package (v0.3.0) Then data.table. <pre class="prettyprint lang-r prettyprint-override"><code>library(bench) library(dplyr, warn.conflicts = FALSE) library(data.table, warn.conflicts = FALSE) set.seed(123) DT = data.table(x = rep(1:10, times = 1e5), y = sample(LETTERS[1:10], 10e5, replace = TRUE), z = rnorm(1e6)) DT[x > 7, mean(z), by = y] #> y V1 #> 1: F -0.0056834238 #> 2: I -0.0016755202 #> 3: J 0.0066061660 #> 4: G -0.0034436348 #> 5: B -0.0070242788 #> 6: E -0.0070462070 #> 7: H 0.0005525803 #> 8: D -0.0043024627 #> 9: A -0.0033609302 #> 10: C 0.0029146372 bench::bench_process_memory() #> current max #> 948.47MB 1.17GB </code></pre> Created on 2020-04-22 by the reprex package (v0.3.0) So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that @Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.) Any ideas, or am I just missing something obvious? P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a <code>bench_process_memory()</code> function that also tracks memory outside of R’s GC heap, which is why I use it here. <pre class="prettyprint lang-r prettyprint-override"><code>sessionInfo() #> R version 3.6.3 (2020-02-29) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Arch Linux #> #> Matrix products: default #> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.4.6 knitr_1.28 magrittr_1.5 tidyselect_1.0.0 #> [5] R6_2.4.1 rlang_0.4.5.9000 stringr_1.4.0 highr_0.8 #> [9] tools_3.6.3 xfun_0.13 htmltools_0.4.0 ellipsis_0.3.0 #> [13] yaml_2.2.1 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0 #> [17] crayon_1.3.4 purrr_0.3.4 vctrs_0.2.99.9011 glue_1.4.0 #> [21] evaluate_0.14 rmarkdown_2.1 stringi_1.4.6 compiler_3.6.3 #> [25] pillar_1.4.3 generics_0.0.2 pkgconfig_2.0.3 </code></pre> Created on 2020-04-22 by the reprex package (v0.3.0)

UPDATE: Following @jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage. dplyr <pre class="prettyprint lang-sh prettyprint-override"><code>$ ./cgmemtime Rscript ~/mem-comp-dplyr.R Child user: 0.526 s Child sys : 0.033 s Child wall: 0.455 s Child high-water RSS : 128952 KiB Recursive and acc. high-water RSS+CACHE : 118516 KiB </code></pre> data.table <pre class="prettyprint lang-sh prettyprint-override"><code>$ ./cgmemtime Rscript ~/mem-comp-dt.R Child user: 0.510 s Child sys : 0.056 s Child wall: 0.464 s Child high-water RSS : 129032 KiB Recursive and acc. high-water RSS+CACHE : 118320 KiB </code></pre> Bottom line: Accurately measuring memory usage from within R is complicated. I'll leave my original answer below because I think it still has value. ORIGINAL ANSWER: Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable: <pre class="prettyprint lang-r prettyprint-override"><code>library(bench) library(dplyr, warn.conflicts = FALSE) library(data.table, warn.conflicts = FALSE) set.seed(123) setDTthreads(1) ## TURN OFF MULTITHREADING DT = data.table(x = rep(1:10, times = 1e5), y = sample(LETTERS[1:10], 10e5, replace = TRUE), z = rnorm(1e6)) DT[x > 7, mean(z), by = y] #> y V1 #> 1: F -0.0056834238 #> 2: I -0.0016755202 #> 3: J 0.0066061660 #> 4: G -0.0034436348 #> 5: B -0.0070242788 #> 6: E -0.0070462070 #> 7: H 0.0005525803 #> 8: D -0.0043024627 #> 9: A -0.0033609302 #> 10: C 0.0029146372 bench::bench_process_memory() #> current max #> 589MB 612MB </code></pre> Created on 2020-04-22 by the reprex package (v0.3.0) Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...

data.table vs dplyr memory use revisited

Tags:

r

data.table

dplyr

I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.)

However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare:

First dplyr.

library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)

DF = tibble(x = rep(1:10, times = 1e5),
                y = sample(LETTERS[1:10], 10e5, replace = TRUE),
                z = rnorm(1e6))

DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z))
#> # A tibble: 10 x 2
#>    y     `mean(z)`
#>  * <chr>     <dbl>
#>  1 A     -0.00336 
#>  2 B     -0.00702 
#>  3 C      0.00291 
#>  4 D     -0.00430 
#>  5 E     -0.00705 
#>  6 F     -0.00568 
#>  7 G     -0.00344 
#>  8 H      0.000553
#>  9 I     -0.00168 
#> 10 J      0.00661

bench::bench_process_memory()
#> current     max 
#>   585MB   611MB

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

Then data.table.

library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)

DT = data.table(x = rep(1:10, times = 1e5),
                y = sample(LETTERS[1:10], 10e5, replace = TRUE),
                z = rnorm(1e6))

DT[x > 7, mean(z), by = y]
#>     y            V1
#>  1: F -0.0056834238
#>  2: I -0.0016755202
#>  3: J  0.0066061660
#>  4: G -0.0034436348
#>  5: B -0.0070242788
#>  6: E -0.0070462070
#>  7: H  0.0005525803
#>  8: D -0.0043024627
#>  9: A -0.0033609302
#> 10: C  0.0029146372

bench::bench_process_memory()
#>  current      max 
#> 948.47MB   1.17GB

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that @Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.)

Any ideas, or am I just missing something obvious?

P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a bench_process_memory() function that also tracks memory outside of R’s GC heap, which is why I use it here.

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6      knitr_1.28        magrittr_1.5      tidyselect_1.0.0 
#>  [5] R6_2.4.1          rlang_0.4.5.9000  stringr_1.4.0     highr_0.8        
#>  [9] tools_3.6.3       xfun_0.13         htmltools_0.4.0   ellipsis_0.3.0   
#> [13] yaml_2.2.1        digest_0.6.25     tibble_3.0.1      lifecycle_0.2.0  
#> [17] crayon_1.3.4      purrr_0.3.4       vctrs_0.2.99.9011 glue_1.4.0       
#> [21] evaluate_0.14     rmarkdown_2.1     stringi_1.4.6     compiler_3.6.3   
#> [25] pillar_1.4.3      generics_0.0.2    pkgconfig_2.0.3

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

958

asked Apr 22 '20 23:04

Grant

1 Answers

UPDATE: Following @jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.

dplyr

$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user:    0.526 s
Child sys :    0.033 s
Child wall:    0.455 s
Child high-water RSS                    :     128952 KiB
Recursive and acc. high-water RSS+CACHE :     118516 KiB

data.table

$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user:    0.510 s
Child sys :    0.056 s
Child wall:    0.464 s
Child high-water RSS                    :     129032 KiB
Recursive and acc. high-water RSS+CACHE :     118320 KiB

Bottom line: Accurately measuring memory usage from within R is complicated.

I'll leave my original answer below because I think it still has value.

ORIGINAL ANSWER:

Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:

library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
setDTthreads(1) ## TURN OFF MULTITHREADING

DT = data.table(x = rep(1:10, times = 1e5),
                y = sample(LETTERS[1:10], 10e5, replace = TRUE),
                z = rnorm(1e6))

DT[x > 7, mean(z), by = y]
#>     y            V1
#>  1: F -0.0056834238
#>  2: I -0.0016755202
#>  3: J  0.0066061660
#>  4: G -0.0034436348
#>  5: B -0.0070242788
#>  6: E -0.0070462070
#>  7: H  0.0005525803
#>  8: D -0.0043024627
#>  9: A -0.0033609302
#> 10: C  0.0029146372

bench::bench_process_memory()
#> current     max 
#>   589MB   612MB

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...

135

answered Oct 18 '22 04:10

Grant

Related questions
                            
                                Converting UTC time to local standard time in R
                            
                                Plot angle between vectors
                            
                                ggplot2: remove blank space for weekends and holidays from x-axis dates
                            
                                Merging Table Header Cells Using tableGrob
                            
                                ggplot2: forcing space for empty second-level category
                            
                                Is there a way to obtain coefficients for each step of the optimization algorithm in glm function?
                            
                                No ggplot2 graphs are working: "Error in y[setdiff(names(y), names(x))] : object of type 'closure' is not subsettable"
                            
                                Manually adding legend values in leaflet
                            
                                Format for ordinal dates (day of month with suffixes -st, -nd, -rd, -th)
                            
                                Impute missing data, while forcing correlation coefficient to remain the same
                            
                                Importing common YAML in rstudio/knitr document
                            
                                What does the lubridate note "method with signature ‘Timespan#Timespan’ chosen for function ‘%/%’" mean?
                            
                                Q-Q plot with ggplot2::stat_qq, colours, single group
                            
                                Align different plot shapes
                            
                                How to extend x-axis and y-axis and narrow the gap in ggthemes::theme_tufte()
                            
                                Find most distant point all other points in R
                            
                                R Shiny App internationalization [closed]
                            
                                Mutate variables in database tables directly using dplyr
                            
                                How to troubleshoot Error: Could not find package root?
                            
                                Proper way to return a pointer to a `new` object from an Rcpp function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

data.table vs dplyr memory use revisited

Tags:

r

data.table

dplyr

Grant

People also ask

1 Answers

Grant

Recent Activity

Donate For Us