I need to read lines in small batches (say 100 at a time) from a gzip file which is a text file that has been compressed using gzip. I use small batches because each line is extremely long.
However I am unable to that with something like this (I think the buffer is not updated):
in.con <- gzfile("somefile.txt.gz")
for (i in 1:100000) {
chunk <- readLines(in.con,n = 100)
# if you inspect chunk in each loop step, say with a print
# you will find that chunk updates once or twice and then
# keeps printing the same data.
}
close(in.con)
How do I accomplish something similar?
NOTES:
.
in.con <- file("some.file.txt", "r", blocking = FALSE)
while(TRUE) {
chunk <- readLines(in.con,n = 2)
if (length(chunk)==0) break;
print(chunk)
}
close(in.con)
resulting in the output:
[1] "1" "2"
[1] "3" "4"
[1] "5" "6"
[1] "7" "8"
[1] "9" "10"
My version information is:
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
This is a bug in gzfile()
. For large files, if no open
parameter is specified, it will read the same line over and over again.
> incon <- gzfile(zfile)
> readLines(incon,1)
[1] First line
> readLines(incon,1)
[1] First line
Even with the open
parameter specified, it just causes an error instead.
> incon <- gzfile(zfile,open="r")
> line <- readLines(incon,1)
Warning message:
In readLines(incon, 1) :
seek on a gzfile connection returned an internal error
Solution: As a workaround, one can instead use a regular file()
connection in binary read mode and wrap it inside a gzcon()
:
> incon <- gzcon(file(zfile,open="rb"))
> readLines(incon,1)
[1] First line
> readLines(incon,1)
[1] Second line
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With