Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to allocate vector in R with plenty of memory available

I am trying to load a large CSV file (226M rows by 38 columns) on R 64-bit using the data.table package. The size of the file on disk is about 27Gb. I am doing this on a server with 64GB of RAM. I shut most everything else down and started a fresh R/Rstudio session so when I start the fread only 2Gb of memory are used. As the read processes, I see the memory usage climb to about 45.6 Gb, and then I get the dreaded Error: cannot allocate vector of size 1.7 Gb. However, there remains over 18Gb available. Is it possible that in 18Gb of RAM there isn't a single contiguous block of 1.7Gb? Does it have to do with the committed size (which I admit to not fully understanding), and if so, is there any way to minimize the committed size so that enough space remains

The list comprises the history of a cohort of users for which I want to aggregate and summarize certain statistics over time. I've been able to import a subset of the 38 columns using select in fread, so I'm not at a complete loss, but it does mean that should I need to work with other variables, I'll need to pick and choose, and may eventually run into the same error.

For the setup which I have, are there other ways to get this entire dataset into memory, or will I either need to continue only importing subsets or move to a big-data friendly platform?

Thank you.

Memory Usage Prior to Read

Memory usage prior to read

Memory Usage at Failure

Memory usage at failure

Session Info

R version 3.3.0 Patched (2016-05-11 r70599)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6

loaded via a namespace (and not attached):
[1] tools_3.3.0  chron_2.3-47
like image 571
Avraham Avatar asked Jul 06 '16 15:07

Avraham


1 Answers

You're running out of memory because some types of data can use less memory as plain text than in memory (the opposite also can be true). The classic example of this is e.g. single digit integers (0-9), which will only occupy a single byte in a text file, but will use 4 bytes of memory in R (and conversely larger than 4-digit numbers will occupy less memory than corresponding text characters would).

One workaround for this is to read those columns as character instead, which will keep the memory footprint the same, and only convert them to integers when doing numeric operations on them. The trade off will naturally be speed.

like image 108
eddi Avatar answered Oct 18 '22 09:10

eddi