Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Open-Source compression algorithm with Checkpoints [closed]

I'm working in C++ with gcc 4.5.0, and msvc8/9.

I'd like to be able to compress a file (10 Gb), then open this file using my application.

However, the file contents are such that I don't necessarily need everything inside of them everytime I use them.

So, for example, one time I open one of these compressed files, and decide that I want to seek 95% of the way through the file without loading it. With compression algorithms like gzip, this is not possible: I must decompress the first 95% of the file, before I can decompress the last 5%.

So, are they any libraries similar to gzip, that are open source 
and available for commercial use, that have built in check points, 
to re-sync the decompression stream?

I have thought that perhaps a loseless audio codec might do the trick. I know that some of these algorithms have checkpoints so you can seek through a music file and not have to wait while the full contents of the music file are decompressed. Are there pitfalls with using an audio codec for data de/compression?

Thanks!

like image 980
J T Avatar asked Apr 05 '11 20:04

J T


4 Answers

bzip2 is free and open-source, and has readily available library implementations. It's block based, so you can decompress only the parts you need. If you need to seek to a particular location in the decompressed file, though, you might need to build a simple index over all the bzip2 blocks, to allow you to determine which one contains the address you need, though.

gzip, although stream based, can be reset on arbitrary block boundaries. The concatenation of any number of gzip streams is itself a valid gzip stream, so you could easily operate gzip in a block-compression mode without breaking compatibility with existing decompressors, too.

like image 91
Nick Johnson Avatar answered Nov 19 '22 03:11

Nick Johnson


I'm not sure about open-source, but there have been/are a fair number of programs that do this. If the input is static, it's pretty trivial -- pick a fixed block size, and re-start the compressor after compressing that much input data.

If the content is dynamic, things get a bit uglier, because when you change the contents of a block of input, that will typically change its size. There are at least two ways to deal with this: one is to insert a small amount of padding between blocks, so small changes can be accommodated in-place (e.g., what started as a 64K block of input gets rounded to an integral number of 512-byte compressed blocks). The second is to create an index to map from compressed blocks to de-compressed blocks. I'm pretty sure a practical solution will normally use both.

like image 36
Jerry Coffin Avatar answered Nov 19 '22 02:11

Jerry Coffin


A simple approach would be to slice your uncompressed content into "blocks" and compress each one independently. They won't compress as well over all (as you won't be "sharing" between the blocks), but you can decompress blocks independently.

"Key frames" in compressed video is sort of a special-case of this general approach.

like image 1
Laurence Gonsalves Avatar answered Nov 19 '22 03:11

Laurence Gonsalves


http://sourceforge.net/projects/gzx

like image 1
xfe Avatar answered Nov 19 '22 03:11

xfe