I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with gzip -dc
streaming the data off the disk.
Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head
and tail
, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc
and pipe into head
, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it's slowly decompressed.
So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?
A GZ file is a compressed archive that is created using the standard gzip (GNU zip) compression algorithm. It may contain multiple compressed files, directories and file stubs. This format was initially developed to replace compression formats on UNIX systems.
The most important difference is that gzip is only capable to compress a single file while zip compresses multiple files one by one and archives them into one single file afterwards. Thus, gzip comes along with tar most of the time (there are other possibilities, though). This comes along with some (dis)advantages.
Select all the files and folders inside the compressed file, or multi-select only the files or folders you want to open by holding the CTRL key and left-clicking on them. Click 1-click Unzip, and choose Unzip to PC or Cloud in the WinZip toolbar under the Unzip/Share tab.
A GZ file is an archive file compressed by the standard GNU zip (gzip) compression algorithm. It typically contains a single compressed file but may also store multiple compressed files. gzip is primarily used on Unix operating systems for file compression.
Yes, you can access a gzip file randomly by reading the entire thing sequentially once and building an index. See examples/zran.c in the zlib distribution.
If you are in control of creating the gzip file, then you can optimize the file for this purpose by building in random access entry points and construct the index while compressing.
You can also create a gzip file with markers by using Z_SYNC_FLUSH
followed by Z_FULL_FLUSH
in zlib's deflate()
to insert two markers and making the next block independent of the previous data. This will reduce the compression, but not by much if you don't do this too often. E.g. once every megabyte should have very little impact. Then you can search for a nine-byte marker (with a much less probable false positive than bzip2's six-byte marker): 00 00 ff ff 00 00 00 ff ff
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With