Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fread takes a lof of memory when "skip" is large

Tags:

r

csv

data.table

I have a large csv file (20G, almost 200million lines) which I cannot load to memory as a whole----> So I want to load it piece by piece.

I didn't find a way to use file connection in fread (like that in readLines)----> So I tried to use "skip":

for(i in 1:100){
lines=fread(filename,nrows=rowPerRead,skip=(i-1)*rowPerRead)
}

This works fine, at beginning. But it becomes slower as skip getting larger---in a nonlinear fashion. It turns out although those lines are skipped, it still takes a lot of memory during the process and only get cleaned when the process is done. And once the memory is used up, the process becomes very slow.

> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1,quote="") })
   user  system elapsed 
   0.71    0.04    0.73 
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1e8,quote="") })
Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:01:47
   user  system elapsed 
  21.89   13.76  106.60 
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1.4e8,quote="") })
Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:02:48
   user  system elapsed 
  16.95   12.49  169.76 
> 

the memory usage for the 2nd and 3rd run. enter image description here

So my questions are : 1. Is there a more memory efficient way to run fread with large skip? 2. Is there a way to run fread from a file connection---so I can continue from last read instead of restart from beginning.

like image 257
Yuan Ren Avatar asked Jan 02 '18 18:01

Yuan Ren


1 Answers

You can use the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system (Linux and Unix-like machines usually have it already, on Windows you may need to install it).

n = 100   # lines to skip
cmd = paste0('gawk "NR > ', n, '" ', filename)
lines = fread(cmd, nrows = rowPerRead)
like image 188
dww Avatar answered Nov 09 '22 02:11

dww