Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is memory managed while overwriting R objects?

I'm handling some large datasets and am doing what I can to stay under R's memory limits. One question came up regarding the overwriting of R objects. I have a large data.table (or any R object), and it has to be copied to tmp multiple times. The question is: does it make any difference if I delete tmp before overwriting it? In code:

for (1:lots_of_times) {
     v_l_d_t_tmp <- copy(very_large_data_table) # Necessary copy of 7GB data
                                                # table on 16GB machine. I can
                                                # afford 2 but not 3 copies.
     ### do stuff to v_l_d_t_tmp and output
     rm (v_l_d_t_tmp)  # The question is whether this rm keeps max memory
                        # usage lower, or if it is equivalent to what an 
                        # overwrite will automatically do on the next iteration.
}

Assume the copy is necessary (If I reach a point where I need to read very_large_data_table from disk at each loop, I'll do that, but the question stands: will it make any difference on max memory usage if I explicitly delete v_l_d_t_tmp before loading into it again?).

Or, to teach the man to fish, what could I have typed (within R, let's not get into ps) to answer this myself?

It's totally OK if the answer turns out to be: "Trust garbage collection."

like image 401
enfascination Avatar asked Apr 30 '15 11:04

enfascination


People also ask

How does R manage memory?

R uses an alternative approach: garbage collection (or GC for short). GC automatically releases memory when an object is no longer used. It does this by tracking how many names point to each object, and when there are no names pointing to an object, it deletes that object.

Does R have memory management?

So the gist of the matter is that R has been improving performance and memory management for a very long time.

How do I free up memory in R?

Using gc() function to remove all objects that are used from memory: gc() is used to remove all objects that are used from memory. reset is an optional parameter. It will return the maximum memory used in Mb.

Are R objects stored in RAM?

Generally speaking, R stores and manipulates all objects in the physical memory of your computer (i.e. the RAM). Therefore, it's important to be aware of the limits of your computing environment with respect to available memory and how that may affect your ability to use R.


2 Answers

Here's another idea... it doesn't directly answer your question, instead tries to get around it by eliminating the memory problem in another way. Might get you thinking:

What if you instead cache the very_large_data_table, and then read it in just once, do what you need to do, and then exit R. Now, write a loop outside of R, and the memory problem vanishes. Granted, this costs you more CPU because you have to read in 7GB multiple times... but it might be worth saving the memory costs. In fact, this halves your memory use, since you don't have to ever copy the table.

In addition, like @konvas pointed out in the comments, I too found that rm() even with gc() never got me what I needed with a long loop, memory would just accumulate and eventually bog down. Exiting R is the easy way out.

I had to do this so often that I wrote a package to help me cache objects like this: simpleCache

if you're interested in trying, it would look something like this:

do this outside of R:

for (1:lots_of_times) {
Rscript my_script.R
}

Then in R, do this... my_script.R:

library(simpleCache)
simpleCache("very_large_data_table", {r code for how 
you make this table }, assignTo="v_l_d_t_tmp") 

 ### do stuff to v_l_d_t_tmp and output
like image 101
nsheff Avatar answered Oct 12 '22 23:10

nsheff


This is a comment more than an answer, but it is becoming too long.

I guess that in this case a call to rm might be proper. I think that starting from the second iteration, you may have 3 tables in memory if you don't call rm. While copying the large object, R cannot free the memory occupied by v_l_d_t_tmp before the end of the copy, since the function call may have an error and in this case the old object must be preserved. Consider this example:

 x<-1:10
 myfunc<-function(y) {Sys.sleep(3);30}

Here I defined an object and a function that takes some time to do something. If you try:

 x<-myfunc()

and break the execution before it ends "naturally", the object x still exists, with its 1:10 content. So, I guess that in your case, even if you use the same symbol, R cannot free its content before or during the copy. It can if you remove it before the following copy. Of course, the object will be removed after the copy, but you may run out of memory during it.

I'm not by any means an expert of the R internals, so don't take for granted what I just said.

like image 21
nicola Avatar answered Oct 12 '22 23:10

nicola