Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What compression/archive formats support inter-file compression?

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.

Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.

like image 502
Toybuilder Avatar asked Dec 23 '22 14:12

Toybuilder


2 Answers

Several formats do inter-file compression.

The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.

More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.

like image 118
CesarB Avatar answered Apr 29 '23 09:04

CesarB


Take a look at google's open-vcdiff.

http://code.google.com/p/open-vcdiff/

It is designed for calculating small compressed deltas and implements RFC 3284.

http://www.ietf.org/rfc/rfc3284.txt

Microsoft has an API for doing something similar, sans any semblance of a standard.

In general the algorithms you are looking for are ones based on Bentley/McIlroy:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470

In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.

They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.

like image 33
Edward Kmett Avatar answered Apr 29 '23 09:04

Edward Kmett