Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does peak memory usage increase when there are more elements to loop/apply over?

I am trying to reduce the memory footprint of an R package and have noticed behaviour that I can't seem to suppress. See the below example:

x <- matrix(runif(1.5e7), ncol = 200)

## CASE 1: Test with half of columns
gc(reset = TRUE)
a <- apply(x[, 1:100], 2, quantile)
gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells   190549  10.2     407500  21.8   222055  11.9
# Vcells 15292303 116.7   35490421 270.8 35484249 270.8
object.size(a)
# 4696 bytes
rm(a)

## CASE 2: Test with all columns
gc(reset = TRUE)
b <- apply(x, 2, quantile)
gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells   190824  10.2     407500  21.8   245786  13.2
# Vcells 15293740 116.7   39292189 299.8 39286529 299.8
object.size(b)
# 8696 bytes
rm(b)

## CASE 3: Test with all columns + call gc
gc(reset = TRUE)
c <- apply(x, 2, function(i) { r <- quantile(i); gc(); r })
gc()
#           used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells   191396  10.3     407500  21.8   197511  10.6
# Vcells 15294307 116.7   45737818 349.0 30877185 235.6
object.size(c)
# 8696 bytes
rm(c)

a and b differ by only ~4kb yet the garbage collector reports a difference of ~30mb between the peak memory usage of cases 1 and 2. c uses less memory than both a and c, I imagine not without a considerable penalty in runtime.

The peak memory allocation seems to positively correlate with the number of columns considered in the call to apply, but why? Does the call to apply result in memory allocation living beyond the scope of an iteration? I would have expected any internal temporaries to be freed (or marked as being unused) by the gc before the end of each iteration.

This behaviour can be reproduced using lapply over data.frames and also with different functions in lieu of quantile.

I am under the impression that I am overlooking a very fundamental aspect of memory usage behaviour in R but still can't wrap my head around it. Ultimately, my question is: how do I further reduce the memory footprint in cases like the example above?

Thanks in advance and do not hesitate to point out any inaccuracies in my question.

EDIT:

As per @ChristopherLouden's suggestion, I used calls to mem in place of gc and all three cases were described as taking ~126.9182mb.

##  http://adv-r.had.co.nz/memory.html#garbarge-collection
mem <- function() {
  bit <- 8L * .Machine$sizeof.pointer
  if (!(bit == 32L || bit == 64L)) {
    stop("Unknown architecture", call. = FALSE)
  }

  node_size <- if (bit == 32L) 28L else 56L

  usage <- gc()
  sum(usage[, 1] * c(node_size, 8)) / (1024 ^ 2)
}
like image 890
Nicolas De Jay Avatar asked Feb 04 '14 21:02

Nicolas De Jay


People also ask

Does while loop consume memory?

After iterating 30-60 000 times when running the AskScript program (below) node process consumed up to 2GB of RAM.

Why does Python use so much memory?

Those numbers can easily fit in a 64-bit integer, so one would hope Python would store those million integers in no more than ~8MB: a million 8-byte objects. In fact, Python uses more like 35MB of RAM to store these numbers. Why? Because Python integers are objects, and objects have a lot of memory overhead.

What is peak RAM?

VmPeak is the maximum amount of memory the process has used since it was started. In order to track the memory usage of a process over time, you can use a tool called munin to track, and show you a nice graph of the memory usage over time.


1 Answers

I think this sentence from the Memory Chapter of Advanced R Programming by Hadley Wickham best summarizes the reason for the discrepancy.

Garbage collection normally happens lazily: R calls gc() when it needs more space. In reality, that R might hold onto the memory after the function has terminated, but it will release it as soon as it's needed

The chapter also has a good function called mem() that allows you to see more clearly how much memory a block of code is using than gc() allows. If time allows, I would redo the test with Wickham's mem() function.

Edit: As Peter noted, the mem() function is deprecated. Use the mem_used() function from the pryr package instead.

like image 166
Christopher Louden Avatar answered Oct 02 '22 17:10

Christopher Louden