Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading 40 GB csv file into R using bigmemory

The title is pretty self explanatory here but I will elaborate as follows. Some of my current techniques in attacking this problem are based on the solutions presented in this question. However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. I am trying to figure out the problem using the bigmemory package but I have been running into difficulties.

Present Constraints:

  • Using a linux server with 16 GB of RAM
  • Size of 40 GB CSV
  • No of rows: 67,194,126,114

Challenges

  • Need to be able to randomly sample smaller datasets (5-10 Million rows) from a big.matrix or equivalent data structure.
  • Need to be able to remove any row with a single instance of NULL while parsing into a big.matrix or equivalent data structure.

So far, results are not good. Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. So, I thought I would ask here to see if anyone has used

Any tips, advice on this line of attack etc.? Or should I change to something else? I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. Thanks !

like image 407
Shion Avatar asked Mar 20 '13 19:03

Shion


1 Answers

I don't know about bigmemory, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in.

Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).

read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
                       !/NULL/{if (rand() < m/(length - NR + 1)) {
                                 print; m--;
                                 if (m == 0) exit;
                              }}\' filename'
        )) -> df

It wasn't obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs.

like image 126
eddi Avatar answered Oct 28 '22 03:10

eddi