Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory profiling with data.table

What is the correct way to profile memory in R code that contains calls to data.table functions? Let's say I want to determine the maximum memory usage during an expression.

This reference indicates that Rprofmem may not be the right choice: https://cran.r-project.org/web/packages/profmem/vignettes/profmem.html

All memory allocations that are done via the native allocVector3() part of R's native API are logged, which means that nearly all memory allocations are logged. Any objects allocated this way are automatically deallocated by R's garbage collector at some point. Garbage collection events are not logged by profmem(). Allocations not logged are those done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects. Such objects are not handled by the R garbage collector.

The data.table source code contains plenty of calls to Calloc() and malloc() so this suggests that Rprofmem will not measure all memory allocated by data.table functions. If Rprofmem is not the right tool, how come Matthew Dowle uses it here: R: loop over columns in data.table?

I've found a reference suggesting similar potential issues for gc() (which can be used to measure maximum memory usage between two calls to gc()): https://r.789695.n4.nabble.com/Determining-the-maximum-memory-usage-of-a-function-td4669977.html

gc() is a good start. Call gc(reset = TRUE) before and gc() after your task, and you will see the maximum extra memory used by R in the interim. (This does not include memory malloced by compiled code, which is much harder to measure as it gets re-used.)

Nothing I've found suggests that similar issues exist with Rprof(memory.profiling=TRUE). Does this mean that the Rprof approach will work for data.table even though it doesn't always use the R API to allocate memory?

If Rprof(memory.profiling=TRUE) in fact is not the right tool for the job, what is?

Would ssh.utils::mem.usage work?

like image 635
Michael Avatar asked Oct 08 '19 00:10

Michael


People also ask

What is data profiling in data science?

The data profile serves as a good data inspection tool and ensures that the data is valid and fit for further consumption. For small datasets that can be loaded into memory to be accessed using python or R, data profiling can be done fairly quickly.

What is the importance of data profile?

These attributes of the data are a good starting point to understand what is contained in each column of the table and begin to get a sense of the distribution of data. The data profile serves as a good data inspection tool and ensures that the data is valid and fit for further consumption.

Is there a native memory profiler for Android Studio?

With the new native memory profiler finding memory leaks and understanding where memory is being held on to just got a little bit easier. Give the native memory profiler a try in Android Studio 4.1, and leave any feedback on our bug tracker.

How to record native memory allocation in Android Studio?

After the application starts and the profile window opens, click on the memory profiler and select “record native allocation” First look at a native memory capture when it is loaded in Android Studio.


1 Answers

This is not related to data.table. Recently there was a discussion on twitter about same dplyr behaviour: https://mobile.twitter.com/healthandstats/status/1182840075001819136

/usr/bin/time -v Rscript -e 'library(data.table); CJ(1:1e4, 1:1e4)' |& grep resident

There is also interesting cgmemtime project, but it requires a little bit more setup.

If you are on Windows I suggest you to move to Linux.

like image 194
jangorecki Avatar answered Nov 15 '22 08:11

jangorecki