I have a large csv file (20G, almost 200million lines) which I cannot load to memory as a whole----> So I want to load it piece by piece.
I didn't find a way to use file connection in fread (like that in readLines)----> So I tried to use "skip":
for(i in 1:100){
lines=fread(filename,nrows=rowPerRead,skip=(i-1)*rowPerRead)
}
This works fine, at beginning. But it becomes slower as skip getting larger---in a nonlinear fashion. It turns out although those lines are skipped, it still takes a lot of memory during the process and only get cleaned when the process is done. And once the memory is used up, the process becomes very slow.
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1,quote="") })
user system elapsed
0.71 0.04 0.73
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1e8,quote="") })
Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:01:47
user system elapsed
21.89 13.76 106.60
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1.4e8,quote="") })
Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:02:48
user system elapsed
16.95 12.49 169.76
>
the memory usage for the 2nd and 3rd run.
So my questions are : 1. Is there a more memory efficient way to run fread with large skip? 2. Is there a way to run fread from a file connection---so I can continue from last read instead of restart from beginning.
You can use the ability of fread
to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system (Linux and Unix-like machines usually have it already, on Windows you may need to install it).
n = 100 # lines to skip
cmd = paste0('gawk "NR > ', n, '" ', filename)
lines = fread(cmd, nrows = rowPerRead)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With