Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would one readLines from a gzip file in r?

Tags:

r

I need to read lines in small batches (say 100 at a time) from a gzip file which is a text file that has been compressed using gzip. I use small batches because each line is extremely long.

However I am unable to that with something like this (I think the buffer is not updated):

in.con <- gzfile("somefile.txt.gz")
for (i in 1:100000) {
  chunk <- readLines(in.con,n = 100)
  # if you inspect chunk in each loop step, say with a print
  # you will find that chunk updates once or twice and then
  # keeps printing the same data.
}
close(in.con)

How do I accomplish something similar?

NOTES:

  1. For small files this will work.
  2. You will need a very large file and when you try to read it multiple times -- you will see that the chunk variable will not update
  3. I think it is because an underlying scan is not reliable on a gzip file
  4. The i variable is just to limit the loop -- i is not needed to be referenced
  5. Some comments seem to be saying that the code will not work with a text file -- I'm posting results that show otherwise:

.

in.con <- file("some.file.txt", "r", blocking = FALSE)
while(TRUE) {
  chunk <- readLines(in.con,n = 2)
  if (length(chunk)==0) break;
  print(chunk)
}
close(in.con)

resulting in the output:

[1] "1" "2"
[1] "3" "4"
[1] "5" "6"
[1] "7" "8"
[1] "9"  "10"

My version information is:

platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.1                         
year           2017                        
month          06                          
day            30                          
svn rev        72865                       
language       R                           
version.string R version 3.4.1 (2017-06-30)
nickname       Single Candle     
like image 421
user1172468 Avatar asked Aug 13 '17 22:08

user1172468


1 Answers

This is a bug in gzfile(). For large files, if no open parameter is specified, it will read the same line over and over again.

> incon <- gzfile(zfile)
> readLines(incon,1)
[1] First line
> readLines(incon,1)
[1] First line

Even with the open parameter specified, it just causes an error instead.

> incon <- gzfile(zfile,open="r")
> line <- readLines(incon,1)
Warning message:
In readLines(incon, 1) :
  seek on a gzfile connection returned an internal error

Solution: As a workaround, one can instead use a regular file() connection in binary read mode and wrap it inside a gzcon():

> incon <- gzcon(file(zfile,open="rb"))
> readLines(incon,1)
[1] First line
> readLines(incon,1)
[1] Second line
like image 192
jweile Avatar answered Sep 26 '22 13:09

jweile