Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R stats - memory issues when allocating a big matrix / Linux

I have read several threads about memory issues in R and I can't seem to find a solution to my problem.

I am running a sort of LASSO regression on several subsets of a big dataset. For some subsets it works well, and for some bigger subsets it does not work, with errors of type "cannot allocate vector of size 1.6Gb". The error occurs at this line of the code:

example <- cv.glmnet(x=bigmatrix, y=price, nfolds=3)

It also depends on the number of variables that were included in "bigmatrix".

I tried on R and R64 for both Mac and R for PC but recently went onto a faster virtual machine on Linux thinking I would avoid any memory issues. It was better but still had some limits, even though memory.limit indicates "Inf".

Is there anyway to make this work or do I have to cut a few variables in the matrix or take a smaller subset of data ?

I have read that R is looking for some contiguous bits of memory and that maybe I should pre-allocate the matrix ? Any idea ?

like image 268
Emmanuel Avatar asked Jan 16 '11 20:01

Emmanuel


2 Answers

Let me build slightly on what @richardh said. All of the data you load with R chews up RAM. So you load your main data and it uses some hunk of RAM. Then you subset the data so the subset is using a smaller hunk. Then the regression algo needs a hunk that is greater than your subset because it does some manipulations and gyrations. Sometimes I am able to better use RAM by doing the following:

  1. save the initial dataset to disk using save()
  2. take a subset of the data
  3. rm() the initial dataset so it is no longer in memory
  4. do analysis on the subset
  5. save results from the analysis
  6. totally dump all items in memory: rm(list=ls())
  7. load the initial dataset from step 1 back into RAM using load()
  8. loop steps 2-7 as needed

Be careful with step 6 and try not to shoot your eye out. That dumps EVERYTHING in R memory. If it's not been saved, it'll be gone. A more subtle approach would be to delete the big objects that you are sure you don't need and not do the rm(list=ls()).

If you still need more RAM, you might want to run your analysis in Amazon's cloud. Their High-Memory Quadruple Extra Large Instance has over 68GB of RAM. Sometimes when I run into memory constraints I find the easiest thing to do is just go to the cloud where I can be as sloppy with RAM as I want to be.

Jeremy Anglim has a good blog post that includes a few tips on memory management in R. In that blog post Jeremy links to this previous StackOverflow question which I found helpful.

like image 163
JD Long Avatar answered Oct 04 '22 05:10

JD Long


I don't think this has to do with continuous memory, but just that R by default works only in RAM (i.e., can't write to cache). Farnsworth's guide to econometrics in R mentions package filehash to enable writing to disk, but I don't have any experience with it.

Your best bet may be to work with smaller subsets, manage memory manually by removing variables you don't need with rm (i.e., run regression, store results, remove old matrix, load new matrix, repeat), and/or getting more RAM. HTH.

like image 39
Richard Herron Avatar answered Oct 04 '22 04:10

Richard Herron