I am trying to load a large CSV file (226M rows by 38 columns) on R 64-bit using the data.table
package. The size of the file on disk is about 27Gb. I am doing this on a server with 64GB of RAM. I shut most everything else down and started a fresh R/Rstudio session so when I start the fread
only 2Gb of memory are used. As the read processes, I see the memory usage climb to about 45.6 Gb, and then I get the dreaded Error: cannot allocate vector of size 1.7 Gb
. However, there remains over 18Gb available. Is it possible that in 18Gb of RAM there isn't a single contiguous block of 1.7Gb? Does it have to do with the committed size (which I admit to not fully understanding), and if so, is there any way to minimize the committed size so that enough space remains
The list comprises the history of a cohort of users for which I want to aggregate and summarize certain statistics over time. I've been able to import a subset of the 38 columns using select
in fread
, so I'm not at a complete loss, but it does mean that should I need to work with other variables, I'll need to pick and choose, and may eventually run into the same error.
For the setup which I have, are there other ways to get this entire dataset into memory, or will I either need to continue only importing subsets or move to a big-data friendly platform?
Thank you.
R version 3.3.0 Patched (2016-05-11 r70599)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6
loaded via a namespace (and not attached):
[1] tools_3.3.0 chron_2.3-47
You're running out of memory because some types of data can use less memory as plain text than in memory (the opposite also can be true). The classic example of this is e.g. single digit integers (0-9), which will only occupy a single byte in a text file, but will use 4 bytes of memory in R (and conversely larger than 4-digit numbers will occupy less memory than corresponding text characters would).
One workaround for this is to read those columns as character
instead, which will keep the memory footprint the same, and only convert them to integers when doing numeric operations on them. The trade off will naturally be speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With