What is the correct way to profile memory in R code that contains calls to <code>data.table</code> functions? Let's say I want to determine the maximum memory usage during an expression. This reference indicates that <code>Rprofmem</code> may not be the right choice: https://cran.r-project.org/web/packages/profmem/vignettes/profmem.html <blockquote> All memory allocations that are done via the native allocVector3() part of R's native API are logged, which means that nearly all memory allocations are logged. Any objects allocated this way are automatically deallocated by R's garbage collector at some point. Garbage collection events are not logged by profmem(). Allocations not logged are those done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects. Such objects are not handled by the R garbage collector. </blockquote> The data.table source code contains plenty of calls to <code>Calloc()</code> and <code>malloc()</code> so this suggests that <code>Rprofmem</code> will not measure all memory allocated by <code>data.table</code> functions. If <code>Rprofmem</code> is not the right tool, how come Matthew Dowle uses it here: R: loop over columns in data.table? I've found a reference suggesting similar potential issues for <code>gc()</code> (which can be used to measure maximum memory usage between two calls to <code>gc()</code>): https://r.789695.n4.nabble.com/Determining-the-maximum-memory-usage-of-a-function-td4669977.html <blockquote> gc() is a good start. Call gc(reset = TRUE) before and gc() after your task, and you will see the maximum extra memory used by R in the interim. (This does not include memory malloced by compiled code, which is much harder to measure as it gets re-used.) </blockquote> Nothing I've found suggests that similar issues exist with <code>Rprof(memory.profiling=TRUE)</code>. Does this mean that the <code>Rprof</code> approach will work for <code>data.table</code> even though it doesn't always use the R API to allocate memory? If <code>Rprof(memory.profiling=TRUE)</code> in fact is not the right tool for the job, what is? Would <code>ssh.utils::mem.usage</code> work?

This is not related to data.table. Recently there was a discussion on twitter about same dplyr behaviour: https://mobile.twitter.com/healthandstats/status/1182840075001819136 <pre class="prettyprint lang-sh prettyprint-override"><code>/usr/bin/time -v Rscript -e 'library(data.table); CJ(1:1e4, 1:1e4)' |& grep resident </code></pre> There is also interesting cgmemtime project, but it requires a little bit more setup. If you are on Windows I suggest you to move to Linux.

Memory profiling with data.table

Tags:

r

data.table

memory-profiling

What is the correct way to profile memory in R code that contains calls to data.table functions? Let's say I want to determine the maximum memory usage during an expression.

This reference indicates that Rprofmem may not be the right choice: https://cran.r-project.org/web/packages/profmem/vignettes/profmem.html

All memory allocations that are done via the native allocVector3() part of R's native API are logged, which means that nearly all memory allocations are logged. Any objects allocated this way are automatically deallocated by R's garbage collector at some point. Garbage collection events are not logged by profmem(). Allocations not logged are those done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects. Such objects are not handled by the R garbage collector.

The data.table source code contains plenty of calls to Calloc() and malloc() so this suggests that Rprofmem will not measure all memory allocated by data.table functions. If Rprofmem is not the right tool, how come Matthew Dowle uses it here: R: loop over columns in data.table?

I've found a reference suggesting similar potential issues for gc() (which can be used to measure maximum memory usage between two calls to gc()): https://r.789695.n4.nabble.com/Determining-the-maximum-memory-usage-of-a-function-td4669977.html

gc() is a good start. Call gc(reset = TRUE) before and gc() after your task, and you will see the maximum extra memory used by R in the interim. (This does not include memory malloced by compiled code, which is much harder to measure as it gets re-used.)

Nothing I've found suggests that similar issues exist with Rprof(memory.profiling=TRUE). Does this mean that the Rprof approach will work for data.table even though it doesn't always use the R API to allocate memory?

If Rprof(memory.profiling=TRUE) in fact is not the right tool for the job, what is?

Would ssh.utils::mem.usage work?

635

asked Oct 08 '19 00:10

Michael

1 Answers

This is not related to data.table. Recently there was a discussion on twitter about same dplyr behaviour: https://mobile.twitter.com/healthandstats/status/1182840075001819136

/usr/bin/time -v Rscript -e 'library(data.table); CJ(1:1e4, 1:1e4)' |& grep resident

There is also interesting cgmemtime project, but it requires a little bit more setup.

If you are on Windows I suggest you to move to Linux.

194

answered Nov 15 '22 08:11

jangorecki

Related questions
                            
                                Using dplyr to group_by and conditionally mutate only with if (without else) statement
                            
                                Animated barplot via gganimate: conflict of view_follow & coord_flip
                            
                                Remove filter based on table with NA
                            
                                My favicon will not display on a browser tab for my app when using open source shiny server
                            
                                Why do string commands via `R -e ..` on Mac vs Linux requires an extra escape?
                            
                                Collapsing group of strings into one string using an if statement within a for loop in R
                            
                                How to do stepwise model with random effect (lme4 + lmerTest?)
                            
                                Edit distance between the facet / strip and the plot
                            
                                avoid repeating the dataframe name when operating on pandas columns
                            
                                Annotate ggplot2 face labels with latex in R
                            
                                bookdown: customize the output filename
                            
                                How to use here() for paths to css, before_body and after_body?
                            
                                Check if y-axis begins at zero
                            
                                Error casted by simple mutate using tidyverse or dplyr
                            
                                how to add custom hovertext in R plotly to two series that reference each other
                            
                                Simple data operations: R vs python
                            
                                How to change column name according to another dataframe in R?
                            
                                How can I transform a SF object into a Spatial Points Data Frame?
                            
                                Unit testing outside of a package in R
                            
                                How to get all columns with the same column name in R at once?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With