The title is pretty self explanatory here but I will elaborate as follows. Some of my current techniques in attacking this problem are based on the solutions presented in this question. However, I am facing several challenges and constraints so I was wondering if someone might attempt to take a stab at this problem. I am trying to figure out the problem using the bigmemory package but I have been running into difficulties.
Present Constraints:
Challenges
So far, results are not good. Evidently, I am failing at something or maybe, I just don't understand the bigmemory documentation well enough. So, I thought I would ask here to see if anyone has used
Any tips, advice on this line of attack etc.? Or should I change to something else? I apologize if this question is very similar to the previous but I thought by scale of data was about 20 times bigger than the previous questions. Thanks !
I don't know about bigmemory
, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL
lines and randomly select N
lines, and then read that in.
Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).
read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
!/NULL/{if (rand() < m/(length - NR + 1)) {
print; m--;
if (m == 0) exit;
}}\' filename'
)) -> df
It wasn't obvious to me what you meant by NULL
, so I used literal understanding of it, but it should be easy to modify it to fit your needs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With