What are some good strategies for determining block size in a deflate algorithm?

Tags:

I'm writing a compression library as a little side project, and I'm far enough along (My library can extract any standard gzip file, as well as produce compliant (but certainly not yet optimal) gzip output) that it's time to figure out a meaningful block termination strategy. Currently, I just cut the blocks off after every 32k of input (LZ77 window size) because it was conveinent and quick to implement -- now I am going back and trying to actually improve compression efficiency.

The Deflate spec has only this to say about it: "The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer", which isn't all that helpful.

I sorted through the SharpZipLib code (as I figured it would be the mosteasily readable open source implementation), and found that it terminates a block every 16k literals of output, ignoring the input. This is easy enough to implement, but it seems like there must be some more targetted approach, especially given the language in the spec "determines that starting a new block with fresh trees would be useful".

So does anyone have any ideas for new strategies, or examples of existing ones?

Thanks in advance!

955

asked Jan 27 '09 17:01

David Hay

1 Answers

As a suggestion to get you going.

A speculative look ahead with a buffer of sufficient size for the indication of superior compression to be worth the change.

This changes the streaming behaviour (more data is required to be input before output occurs) and significantly complicates operations like flush. It is also a considerable extra load in the compression stakes.

In the general case it would be possible to ensure that this produced the optimal output simply by branching at each point where it is possible to start a new block, taking both branches recursing as necessary till all routes are taken. The path that had the nest behaviour wins. This is not likely to be feasible on non trivial input sizes since the choice of when to start a new block is so open.

Simply restricting it to a minimum of 8K output literals but prevent more than 32K literals in a block would result in a relatively tractable basis for trying speculative algorithms. call 8K a sub block.

The simplest of which would be (pseudo code):

create empty sub block called definite
create empty sub block called specChange
create empty sub block called specKeep
target = definite
While (incomingData)
{
  compress data into target(s)    
  if (definite.length % SUB_BLOCK_SIZ) == 0)
  {
    if (targets is definite)
    {
      targets becomes 
        specChange assuming new block 
        specKeep assuming same block as definite
    }        
    else
    {
      if (compression specChange - OVERHEAD better than specKeep)
      {
        flush definite as a block.
        definite = specChange
        specKeep,specChange = empty
        // target remains specKeep,specChange as before 
        but update the meta data associated with specChange to be fresh
      }
      else 
      {
        definite += specKeep
        specKeep,specChange = empty
        // again update the block meta data
        if (definite is MAX_BLOCK_SIZE)
        {
          flush definite
          target becomes definite 
        }
      }
    }
  }
}
take best of specChange/specKeep if non empty and append to definite
flush definite.

OVERHEAD is some constant to account for the cost of switching over blocks

This is rough, and could likely be improved but is a start for analysis if nothing else. Instrument the code for information about what causes a switch, use that to determine good heuristics that a change might be beneficial (perhaps that the compression ratio has dropped significantly).

This could lead to the building of specChange being done only when the heuristic considered it reasonable. If the heuristic turns out be be a strong indicator you could then do away with the speculative nature and simply decide to swap at the point no matter what.

135

answered Oct 15 '22 05:10

ShuggyCoUk

Related questions
                            
                                Minimal change between two arrays
                            
                                Modeling The Shunting-Yard Algorithm
                            
                                How to detect if the given graph has a cycle containing all of its nodes? Does the suggested algorithm have any flaws?
                            
                                What memory management algorithms are used by the major compiler vendors?
                            
                                Regular expression matching algorithm in Java
                            
                                Reducing time complexity in maximal minimum-sum 2-partitioning of an array
                            
                                Algorithm for generating random network
                            
                                Finding the furthest point in a grid when compared to other points
                            
                                Find the number of pairs where the first element is divisible by second element
                            
                                Solving a puzzle using search algorithms
                            
                                Unexpected NullReferenceException in F# Algorithm
                            
                                Rounding a list of values to the nearest value from another list in python
                            
                                Find a region with maximum sum of top-K points
                            
                                Combined multiply divide operation on 64bit signed integer without overflow
                            
                                Theoretically, what data structure can I use for trees with shared memory?
                            
                                Disjoint sets on apache spark
                            
                                Code in for-loops vs if-else statements
                            
                                LCOM4 interrogation about way to calculate
                            
                                Algorithm to find the largest inscribed chord of a closed polyline
                            
                                How to find the farthest point (from a set of points) from a given point efficiently?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are some good strategies for determining block size in a deflate algorithm?

Tags:

algorithm

compression

gzip

deflate

David Hay

People also ask

1 Answers

ShuggyCoUk

Recent Activity

Donate For Us