I have a zipped binary file under the Windows operating system that I am trying to read with R. So far it works using the unz() function in combination with the readBin() function.
> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> readBin(bin.con,
"double",
n = byte_chunk,
size = 8L,
endian = "little")
> close(bin.con)
Where zip_path is the path to the zip file, file_in_zip is the filename within the zip file that is to be read and byte_chunk the number of bytes that I want to read.
In my use case, the readBin operation is part of a loop and gradually reads the whole binary file. However, I rarely want to read everything and often I know precisely which parts I want to read. Unfortunately, readBin doesn't have a start/skip argument to skip the first n bytes. Therefore I tried to conditionally replace readBin() with seek() in order to skip the actual reading of the unwanted parts.
When I try this, I get an error:
> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> seek(bin.con, where = bytes_to_skip, origin = 'current')
Error in seek.connection(bin.con, where = bytes_to_skip, origin = "current") :
seek not enabled for this connection
> close(bin.con)
So far, I didn't find a way to solve this error. Similar questions can be found here (unfortunately without a satisfactory answer):
Tips all over the internet suggest adding the open = 'r' argument to unz() or dropping the open argument altogether but that only works for non-binary files (since the default is 'r'). People also suggest to unzip the files first, but since the files are quite big, this is practically impossible.
Is there any work-around to seek in a binary zipped file or read with a byte offset (potentially using C++ via the Rcpp package)?
Update:
Further research seems to indicate that seek() in zip files is not an easy problem. This question suggests a c++ library that can at best use a coarse seek. This Python question indicates that an exact seek is completely impossible because of the way how zip is implemented (although it doesn't contradict the coarse seek method).
Here's a bit of a hack that might work for you. Here's a fake binary file:
writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
# [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10
And here's the produced zip file:
zip("file.zip", "file.bin")
# adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
# [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f
This uses a temporary intermediate binary file.
system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09
This method offsets the "expense" of dealing with the size of the stored binary data to the shell/pipe, out of R.
This worked on win10, R-3.3.2. I'm using dd
from Git for Windows (version 2.11.0.3, though 2.11.1 is available), and unzip
and sh
from RTools.
Sys.which(c("dd", "unzip", "sh"))
# dd
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe"
# unzip
# "c:\\Rtools\\bin\\unzip.exe"
# sh
# "c:\\Rtools\\bin\\sh.exe"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With