Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: clarification on memory management

Tags:

memory

r

bigdata

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.

So, I have code that looks something like this:

foo = function (bigm, inTrain, moreParamsList) {
  parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
  do.call(svm, parsList)
}

What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?

Edit:

Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?

Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?

Thanks.

like image 226
asb Avatar asked Oct 21 '22 22:10

asb


2 Answers

I am just going to put in here what I find from my research on this topic:

I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:

In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.
like image 188
asb Avatar answered Oct 23 '22 23:10

asb


For your first part of the question, you can use tracemem :

This function marks an object so that a message is printed whenever the internal code copies the object

Here an example:

a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a        ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298]   ## object a is copied twice 
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]   
untracemem(a)
like image 33
agstudy Avatar answered Oct 24 '22 00:10

agstudy