Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

zlib, deflate: How much memory to allocate?

I am using zlib to compress a stream of text data. The text data comes in chunks, and for each chunk, deflate() is called, with flush set to Z_NO_FLUSH. Once all chunks have been retrieved, deflate() is called with flush set to Z_FINISH.

Naturally, deflate() doesn't produce compressed output on each call. It internally accumulates data to achieve a high compression rate. And that's fine! Every time deflate() produces compressed output, that output is appended to a database field - a slow process.

However, once deflate() produces compressed data, that data may not fit into the provided output buffer, deflate_out. Therefore several calls to deflate() are required. And that is what I want to avoid:

Is there a way to make deflate_out always large enough so that deflate() can store all the compressed data in it, every times it decides to produce output?

Notes:

  • The total size of the uncompressed data is not known beforehand. As mentioned above, the uncompressed data comes in chunks, and the compressed data is appended to a database field, also in chunks.

  • In the include file zconf.h I have found the following comment. Is that perhaps what I am looking for? I.e. is (1 << (windowBits+2)) + (1 << (memLevel+9)) the maximum size in bytes of compressed data that deflate() may produce?

    /* The memory requirements for deflate are (in bytes):
                (1 << (windowBits+2)) +  (1 << (memLevel+9))
     that is: 128K for windowBits=15  +  128K for memLevel = 8  (default values)
     plus a few kilobytes for small objects. For example, if you want to reduce
     the default memory requirements from 256K to 128K, compile with
         make CFLAGS="-O -DMAX_WBITS=14 -DMAX_MEM_LEVEL=7"
     Of course this will generally degrade compression (there's no free lunch).
    
       The memory requirements for inflate are (in bytes) 1 << windowBits
     that is, 32K for windowBits=15 (default value) plus a few kilobytes
     for small objects.
    */
    
like image 339
feklee Avatar asked Jan 17 '12 23:01

feklee


2 Answers

deflateBound() is helpful only if you do all of the compression in a single step, or if you force deflate to compress all of the input data currently available to it and emit compressed data for all of that input. You would do that with a flush parameter such as Z_BLOCK, Z_PARTIAL_FLUSH, etc.

If you want to use Z_NO_FLUSH, then it becomes far more difficult as well as inefficient to attempt to predict the largest amount of output deflate() might emit on the next call. You don't know how much of the input was consumed at the time the last burst of compressed data was emitted, so you need to assume almost none of it, with the buffer size growing unnecessarily. However you attempt to estimate the maximum output, you will be doing a lot of unnecessary mallocs or reallocs for no good reason, which is inefficient.

There is no point to avoid calling deflate() for more output. If you simply loop on deflate() until it has no more output for you, then you can use a fixed output buffer malloced once. That is how the deflate() and inflate() interface was designed to be used. You can look at http://zlib.net/zlib_how.html for a well-documented example of how to use the interface.

By the way, there is a deflatePending() function in the latest version of zlib (1.2.6) that lets you know how much output deflate() has waiting to deliver.

like image 184
Mark Adler Avatar answered Sep 24 '22 02:09

Mark Adler


While looking at the sources for a hint, I fell over

/* =========================================================================
 * Flush as much pending output as possible. All deflate() output goes
 * through this function so some applications may wish to modify it
 * to avoid allocating a large strm->next_out buffer and copying into it.
 * (See also read_buf()).
 */
local void flush_pending(strm)
    z_streamp strm;
{
    unsigned len = strm->state->pending;
...

tracing the use of void flush_pending() throughout deflate() shows, that an upper bound on the needed output buffer in the middle of the stream is

strm->state->pending + deflateBound(strm, strm->avail_in)

the first part accounts for data still in the pipe from previous calls to deflate(), the second part accounts for the not-yet processed data of length avail_in.

like image 27
Eugen Rieck Avatar answered Sep 25 '22 02:09

Eugen Rieck