Compression formats with good support for random access within archives?

Tags:

This is similar to a previous question, but the answers there don't satisfy my needs and my question is slightly different:

I currently use gzip compression for some very large files which contain sorted data. When the files are not compressed, binary search is a handy and efficient way to support seeking to a location in the sorted data.

But when the files are compressed, things get tricky. I recently found out about zlib's Z_FULL_FLUSH option, which can be used during compression to insert "sync points" in the compressed output (inflateSync() can then begin reading from various points in the file). This is OK, though files I already have would have to be recompressed to add this feature (and strangely gzip doesn't have an option for this, but I'm willing to write my own compression program if I must).

It seems from one source that even Z_FULL_FLUSH is not a perfect solution...not only is it not supported by all gzip archives, but the very idea of detecting sync points in archives may produce false positives (either by coincidence with the magic number for sync points, or due to the fact that Z_SYNC_FLUSH also produces sync points but they are not usable for random access).

Is there a better solution? I'd like to avoid having auxiliary files for indexing if possible, and explicit, default support for quasi-random access would be helpful (even if it's large-grained--like being able to start reading at each 10 MB interval). Is there another compression format with better support for random reads than gzip?

Edit: As I mentioned, I wish to do binary search in the compressed data. I don't need to seek to a specific (uncompressed) position--only to seek with some coarse granularity within the compressed file. I just want support for something like "Decompress the data starting roughly 50% (25%, 12.5%, etc.) of the way into this compressed file."

883

asked Jan 09 '09 22:01

John Zwinck

2 Answers

Take a look at dictzip. It is compatible with gzip and allows coarse random access.

An excerpt from its man page:

dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data to be stored in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, [dictzcat --start] will make use of this data to perform pseudo-random access on the file.

I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.

Update:

I improved dictzip to have no file size limit. My implementation is under MIT license.

153

answered Nov 10 '22 16:11

Ivo Danihelka

I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.

For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.

Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.

answered Nov 10 '22 16:11

jpalecek

Related questions
                            
                                Does GZIP Compression Level Have Any Impact On Decompression
                            
                                Compression library using Nvidia's CUDA [closed]
                            
                                Extracting zip file contents to specific directory in Python 2.7
                            
                                What do you use to minimize and compress JavaScript libraries? [closed]
                            
                                How can I protect myself from a zip bomb?
                            
                                How to compress a String in Java?
                            
                                GZipStream or DeflateStream class?
                            
                                What is the meaning of O( polylog(n) )? In particular, how is polylog(n) defined?
                            
                                how to config grunt.js to minify files separately
                            
                                Compressing / Decompressing Folders & Files
                            
                                How can I Zip and Unzip a string using GZIPOutputStream that is compatible with .Net?
                            
                                Is the git binary diff algorithm (delta storage) standardized?
                            
                                uncompress a .txt.gz file in mac?
                            
                                How do you create a .gz file using PHP?
                            
                                Deleting files after adding to tar archive
                            
                                How many times can a file be compressed?
                            
                                Python: Inflate and Deflate implementations
                            
                                How to send a compressed archive that contains executables so that Google's attachment filter won't reject it [closed]
                            
                                When compressing and encrypting, should I compress first, or encrypt first? [closed]
                            
                                Why does base64-encoded data compress so poorly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compression formats with good support for random access within archives?

Tags:

compression

gzip

zlib

archive

random-access