Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Out of memory when modifying a big R data.frame

Tags:

dataframe

r

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:

dataframe[[17]][37544]=0 

It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)

I found this way is better:

dataframe[37544, 17]=0

but R's footprint still doubled and the command takes quite some time to run.

From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?

Thanks so much for your help!

Tao

like image 784
agmao Avatar asked Feb 29 '12 23:02

agmao


People also ask

How large can an R Dataframe be?

The number is 2^31 - 1. This is the maximum number of rows for a data. frame, but it is so large you are far more likely to run out of memory for even single vectors before you start collecting several of them.

Why is R taking up so much memory?

R uses more memory probably because of some copying of objects. Although these temporary copies get deleted, R still occupies the space. To give this memory back to the OS you can call the gc function. However, when the memory is needed, gc is called automatically.

Can R run out of memory?

Windows users may get the error that R has run out of memory. If you have R already installed and subsequently install more RAM, you may have to reinstall R in order to take advantage of the additional capacity.

How much RAM do I need for R?

R is memory intensive, so it's best to get as much RAM as possible. If you use virtual machines you might have restrictions on how much memory you can allocate to a single instance. In that case we recommend getting as much memory as possible and consider using multiple nodes. Minimum (2 core / 4G).


1 Answers

Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.

When should I use the := operator in data.table?

Why has data.table defined := rather than overloading <-?

Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.

And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".

For completeness :

require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :

DT[37544, Q:=0]                # using column name (often preferred)

DT[37544, 17:=0, with=FALSE]   # using column number

col = "Q"
DT[37544, col:=0, with=FALSE]  # variable holding name

col = 17
DT[37544, col:=0, with=FALSE]  # variable holding number

set(DT,37544L,17L,0)           # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)

But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.

like image 191
Matt Dowle Avatar answered Nov 16 '22 03:11

Matt Dowle