I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with <code>gzip -dc</code> streaming the data off the disk. Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with <code>head</code> and <code>tail</code>, but I'm not sure how to do it efficiently with a compressed file; if I <code>gzip -dc</code> and pipe into <code>head</code>, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it's slowly decompressed. So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?

Yes, you can access a gzip file randomly by reading the entire thing sequentially once and building an index. See examples/zran.c in the zlib distribution. If you are in control of creating the gzip file, then you can optimize the file for this purpose by building in random access entry points and construct the index while compressing. You can also create a gzip file with markers by using <code>Z_SYNC_FLUSH</code> followed by <code>Z_FULL_FLUSH</code> in zlib's <code>deflate()</code> to insert two markers and making the next block independent of the previous data. This will reduce the compression, but not by much if you don't do this too often. E.g. once every megabyte should have very little impact. Then you can search for a nine-byte marker (with a much less probable false positive than bzip2's six-byte marker): <code>00 00 ff ff 00 00 00 ff ff</code>.

Random access to gzipped files?

Tags:

unix

concurrency

gzip

streaming

I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with gzip -dc streaming the data off the disk.

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head and tail, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it's slowly decompressed.

So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?

404

asked Jan 08 '13 23:01

Cera

1 Answers

Yes, you can access a gzip file randomly by reading the entire thing sequentially once and building an index. See examples/zran.c in the zlib distribution.

If you are in control of creating the gzip file, then you can optimize the file for this purpose by building in random access entry points and construct the index while compressing.

You can also create a gzip file with markers by using Z_SYNC_FLUSH followed by Z_FULL_FLUSH in zlib's deflate() to insert two markers and making the next block independent of the previous data. This will reduce the compression, but not by much if you don't do this too often. E.g. once every megabyte should have very little impact. Then you can search for a nine-byte marker (with a much less probable false positive than bzip2's six-byte marker): 00 00 ff ff 00 00 00 ff ff.

157

answered Sep 21 '22 06:09

Mark Adler

Related questions
                            
                                Most powerful examples of Unix commands or scripts every programmer should know
                            
                                unix join separator char
                            
                                How to determine what user and group a Python script is running as?
                            
                                Get Month & Day from Date
                            
                                Find out if file has been modified within the last 2 minutes
                            
                                Add user to group but not reflected when run "id"
                            
                                Error of java path on loading rJava package
                            
                                Variables as commands in Bash scripts
                            
                                Shell script templates [closed]
                            
                                Get the characters after the last index of a substring from a string
                            
                                Process size on UNIX
                            
                                Dependency Walker equivalent for Linux? [duplicate]
                            
                                gprof reports no time accumulated
                            
                                Is there a way to find the running time of the last executed command in the shell?
                            
                                What's the equivalent of Windows' QueryPerformanceCounter on OSX?
                            
                                Who executes first after fork(): parent or the child?
                            
                                Awk consider double quoted string as one token and ignore space in between
                            
                                unix- show the second line of the file
                            
                                How to make tail display only the lines that have a specific text?
                            
                                What does ** mean in a path?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With