Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read binary files in R from a zipped file and a known starting position (byte offset)

Tags:

r

binary

rcpp

I have a zipped binary file under the Windows operating system that I am trying to read with R. So far it works using the unz() function in combination with the readBin() function.

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> readBin(bin.con,
          "double", 
          n = byte_chunk, 
          size = 8L, 
          endian = "little")
> close(bin.con)

Where zip_path is the path to the zip file, file_in_zip is the filename within the zip file that is to be read and byte_chunk the number of bytes that I want to read.

In my use case, the readBin operation is part of a loop and gradually reads the whole binary file. However, I rarely want to read everything and often I know precisely which parts I want to read. Unfortunately, readBin doesn't have a start/skip argument to skip the first n bytes. Therefore I tried to conditionally replace readBin() with seek() in order to skip the actual reading of the unwanted parts.

When I try this, I get an error:

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> seek(bin.con, where = bytes_to_skip, origin = 'current')
Error in seek.connection(bin.con, where = bytes_to_skip, origin = "current") : 
  seek not enabled for this connection
> close(bin.con)

So far, I didn't find a way to solve this error. Similar questions can be found here (unfortunately without a satisfactory answer):

  • https://stat.ethz.ch/pipermail/r-help/2007-December/148847.html (no answer)
  • http://r.789695.n4.nabble.com/reading-file-in-zip-archive-td4631853.html (no answer but reproducible example)

Tips all over the internet suggest adding the open = 'r' argument to unz() or dropping the open argument altogether but that only works for non-binary files (since the default is 'r'). People also suggest to unzip the files first, but since the files are quite big, this is practically impossible.

Is there any work-around to seek in a binary zipped file or read with a byte offset (potentially using C++ via the Rcpp package)?

Update:

Further research seems to indicate that seek() in zip files is not an easy problem. This question suggests a c++ library that can at best use a coarse seek. This Python question indicates that an exact seek is completely impossible because of the way how zip is implemented (although it doesn't contradict the coarse seek method).

like image 940
takje Avatar asked Jan 30 '17 12:01

takje


1 Answers

Here's a bit of a hack that might work for you. Here's a fake binary file:

writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
#  [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10

And here's the produced zip file:

zip("file.zip", "file.bin")
#   adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
#  [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f

This uses a temporary intermediate binary file.

system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09

This method offsets the "expense" of dealing with the size of the stored binary data to the shell/pipe, out of R.

This worked on win10, R-3.3.2. I'm using dd from Git for Windows (version 2.11.0.3, though 2.11.1 is available), and unzip and sh from RTools.

Sys.which(c("dd", "unzip", "sh"))
#                                    dd 
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe" 
#                                 unzip 
#          "c:\\Rtools\\bin\\unzip.exe" 
#                                    sh 
#             "c:\\Rtools\\bin\\sh.exe" 
like image 53
r2evans Avatar answered Oct 22 '22 20:10

r2evans