I'm writing a compression library as a little side project, and I'm far enough along (My library can extract any standard gzip file, as well as produce compliant (but certainly not yet optimal) gzip output) that it's time to figure out a meaningful block termination strategy. Currently, I just cut the blocks off after every 32k of input (LZ77 window size) because it was conveinent and quick to implement -- now I am going back and trying to actually improve compression efficiency.
The Deflate spec has only this to say about it: "The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer", which isn't all that helpful.
I sorted through the SharpZipLib code (as I figured it would be the mosteasily readable open source implementation), and found that it terminates a block every 16k literals of output, ignoring the input. This is easy enough to implement, but it seems like there must be some more targetted approach, especially given the language in the spec "determines that starting a new block with fresh trees would be useful".
So does anyone have any ideas for new strategies, or examples of existing ones?
Thanks in advance!
The deflation algorithm used by gzip (also zip and zlib) is a variation of LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance,length).
GZIP, short for GNU Zip, is a compression/decompression format developed as part of a larger project to create a free software alternative to UNIX in the 1980s. This open source compression format does not support archiving, so it is used to compress single files. GZIP produces zipped files with the . gz extension.
As a suggestion to get you going.
A speculative look ahead with a buffer of sufficient size for the indication of superior compression to be worth the change.
This changes the streaming behaviour (more data is required to be input before output occurs) and significantly complicates operations like flush. It is also a considerable extra load in the compression stakes.
In the general case it would be possible to ensure that this produced the optimal output simply by branching at each point where it is possible to start a new block, taking both branches recursing as necessary till all routes are taken. The path that had the nest behaviour wins. This is not likely to be feasible on non trivial input sizes since the choice of when to start a new block is so open.
Simply restricting it to a minimum of 8K output literals but prevent more than 32K literals in a block would result in a relatively tractable basis for trying speculative algorithms. call 8K a sub block.
The simplest of which would be (pseudo code):
create empty sub block called definite
create empty sub block called specChange
create empty sub block called specKeep
target = definite
While (incomingData)
{
compress data into target(s)
if (definite.length % SUB_BLOCK_SIZ) == 0)
{
if (targets is definite)
{
targets becomes
specChange assuming new block
specKeep assuming same block as definite
}
else
{
if (compression specChange - OVERHEAD better than specKeep)
{
flush definite as a block.
definite = specChange
specKeep,specChange = empty
// target remains specKeep,specChange as before
but update the meta data associated with specChange to be fresh
}
else
{
definite += specKeep
specKeep,specChange = empty
// again update the block meta data
if (definite is MAX_BLOCK_SIZE)
{
flush definite
target becomes definite
}
}
}
}
}
take best of specChange/specKeep if non empty and append to definite
flush definite.
OVERHEAD is some constant to account for the cost of switching over blocks
This is rough, and could likely be improved but is a start for analysis if nothing else. Instrument the code for information about what causes a switch, use that to determine good heuristics that a change might be beneficial (perhaps that the compression ratio has dropped significantly).
This could lead to the building of specChange being done only when the heuristic considered it reasonable. If the heuristic turns out be be a strong indicator you could then do away with the speculative nature and simply decide to swap at the point no matter what.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With