This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.
Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.
Several formats do inter-file compression.
The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.
More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.
Take a look at google's open-vcdiff.
http://code.google.com/p/open-vcdiff/
It is designed for calculating small compressed deltas and implements RFC 3284.
http://www.ietf.org/rfc/rfc3284.txt
Microsoft has an API for doing something similar, sans any semblance of a standard.
In general the algorithms you are looking for are ones based on Bentley/McIlroy:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.
They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With