Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are some good strategies for determining block size in a deflate algorithm?

I'm writing a compression library as a little side project, and I'm far enough along (My library can extract any standard gzip file, as well as produce compliant (but certainly not yet optimal) gzip output) that it's time to figure out a meaningful block termination strategy. Currently, I just cut the blocks off after every 32k of input (LZ77 window size) because it was conveinent and quick to implement -- now I am going back and trying to actually improve compression efficiency.

The Deflate spec has only this to say about it: "The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer", which isn't all that helpful.

I sorted through the SharpZipLib code (as I figured it would be the mosteasily readable open source implementation), and found that it terminates a block every 16k literals of output, ignoring the input. This is easy enough to implement, but it seems like there must be some more targetted approach, especially given the language in the spec "determines that starting a new block with fresh trees would be useful".

So does anyone have any ideas for new strategies, or examples of existing ones?

Thanks in advance!

like image 955
David Hay Avatar asked Jan 27 '09 17:01

David Hay


People also ask

How does the deflate algorithm work?

The deflation algorithm used by gzip (also zip and zlib) is a variation of LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance,length).

What is gzip format?

GZIP, short for GNU Zip, is a compression/decompression format developed as part of a larger project to create a free software alternative to UNIX in the 1980s. This open source compression format does not support archiving, so it is used to compress single files. GZIP produces zipped files with the . gz extension.


1 Answers

As a suggestion to get you going.

A speculative look ahead with a buffer of sufficient size for the indication of superior compression to be worth the change.

This changes the streaming behaviour (more data is required to be input before output occurs) and significantly complicates operations like flush. It is also a considerable extra load in the compression stakes.

In the general case it would be possible to ensure that this produced the optimal output simply by branching at each point where it is possible to start a new block, taking both branches recursing as necessary till all routes are taken. The path that had the nest behaviour wins. This is not likely to be feasible on non trivial input sizes since the choice of when to start a new block is so open.

Simply restricting it to a minimum of 8K output literals but prevent more than 32K literals in a block would result in a relatively tractable basis for trying speculative algorithms. call 8K a sub block.

The simplest of which would be (pseudo code):

create empty sub block called definite
create empty sub block called specChange
create empty sub block called specKeep
target = definite
While (incomingData)
{
  compress data into target(s)    
  if (definite.length % SUB_BLOCK_SIZ) == 0)
  {
    if (targets is definite)
    {
      targets becomes 
        specChange assuming new block 
        specKeep assuming same block as definite
    }        
    else
    {
      if (compression specChange - OVERHEAD better than specKeep)
      {
        flush definite as a block.
        definite = specChange
        specKeep,specChange = empty
        // target remains specKeep,specChange as before 
        but update the meta data associated with specChange to be fresh
      }
      else 
      {
        definite += specKeep
        specKeep,specChange = empty
        // again update the block meta data
        if (definite is MAX_BLOCK_SIZE)
        {
          flush definite
          target becomes definite 
        }
      }
    }
  }
}
take best of specChange/specKeep if non empty and append to definite
flush definite.

OVERHEAD is some constant to account for the cost of switching over blocks

This is rough, and could likely be improved but is a start for analysis if nothing else. Instrument the code for information about what causes a switch, use that to determine good heuristics that a change might be beneficial (perhaps that the compression ratio has dropped significantly).

This could lead to the building of specChange being done only when the heuristic considered it reasonable. If the heuristic turns out be be a strong indicator you could then do away with the speculative nature and simply decide to swap at the point no matter what.

like image 135
ShuggyCoUk Avatar answered Oct 15 '22 05:10

ShuggyCoUk