Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Inflate and Deflate implementations

I am interfacing with a server that requires that data sent to it is compressed with Deflate algorithm (Huffman encoding + LZ77) and also sends data that I need to Inflate.

I know that Python includes Zlib, and that the C libraries in Zlib support calls to Inflate and Deflate, but these apparently are not provided by the Python Zlib module. It does provide Compress and Decompress, but when I make a call such as the following:

result_data = zlib.decompress( base64_decoded_compressed_string )

I receive the following error:

Error -3 while decompressing data: incorrect header check

Gzip does no better; when making a call such as:

result_data = gzip.GzipFile( fileobj = StringIO.StringIO( base64_decoded_compressed_string ) ).read()

I receive the error:

IOError: Not a gzipped file

which makes sense as the data is a Deflated file not a true Gzipped file.

Now I know that there is a Deflate implementation available (Pyflate), but I do not know of an Inflate implementation.

It seems that there are a few options:

  1. Find an existing implementation (ideal) of Inflate and Deflate in Python
  2. Write my own Python extension to the zlib c library that includes Inflate and Deflate
  3. Call something else that can be executed from the command line (such as a Ruby script, since Inflate/Deflate calls in zlib are fully wrapped in Ruby)
  4. ?

I am seeking a solution, but lacking a solution I will be thankful for insights, constructive opinions, and ideas.

Additional information: The result of deflating (and encoding) a string should, for the purposes I need, give the same result as the following snippet of C# code, where the input parameter is an array of UTF bytes corresponding to the data to compress:

public static string DeflateAndEncodeBase64(byte[] data)
{
    if (null == data || data.Length < 1) return null;
    string compressedBase64 = "";

    //write into a new memory stream wrapped by a deflate stream
    using (MemoryStream ms = new MemoryStream())
    {
        using (DeflateStream deflateStream = new DeflateStream(ms, CompressionMode.Compress, true))
        {
            //write byte buffer into memorystream
            deflateStream.Write(data, 0, data.Length);
            deflateStream.Close();

            //rewind memory stream and write to base 64 string
            byte[] compressedBytes = new byte[ms.Length];
            ms.Seek(0, SeekOrigin.Begin);
            ms.Read(compressedBytes, 0, (int)ms.Length);
            compressedBase64 = Convert.ToBase64String(compressedBytes);
        }
    }
    return compressedBase64;
}

Running this .NET code for the string "deflate and encode me" gives the result

7b0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8iZvl5mbV5mi1nab6cVrM8XeT/Dw==

When "deflate and encode me" is run through the Python Zlib.compress() and then base64 encoded, the result is "eJxLSU3LSSxJVUjMS1FIzUvOT0lVyE0FAFXHB6k=".

It is clear that zlib.compress() is not an implementation of the same algorithm as the standard Deflate algorithm.

More Information:

The first 2 bytes of the .NET deflate data ("7b0HY..."), after b64 decoding are 0xEDBD, which does not correspond to Gzip data (0x1f8b), BZip2 (0x425A) data, or Zlib (0x789C) data.

The first 2 bytes of the Python compressed data ("eJxLS..."), after b64 decoding are 0x789C. This is a Zlib header.

SOLVED

To handle the raw deflate and inflate, without header and checksum, the following things needed to happen:

On deflate/compress: strip the first two bytes (header) and the last four bytes (checksum).

On inflate/decompress: there is a second argument for window size. If this value is negative it suppresses headers. here are my methods currently, including the base64 encoding/decoding - and working properly:

import zlib
import base64

def decode_base64_and_inflate( b64string ):
    decoded_data = base64.b64decode( b64string )
    return zlib.decompress( decoded_data , -15)

def deflate_and_base64_encode( string_val ):
    zlibbed_str = zlib.compress( string_val )
    compressed_string = zlibbed_str[2:-4]
    return base64.b64encode( compressed_string )
like image 551
Demi Avatar asked Jul 06 '09 23:07

Demi


2 Answers

You can still use the zlib module to inflate/deflate data. The gzip module uses it internally, but adds a file-header to make it into a gzip-file. Looking at the gzip.py file, something like this could work:

import zlib

def deflate(data, compresslevel=9):
    compress = zlib.compressobj(
            compresslevel,        # level: 0-9
            zlib.DEFLATED,        # method: must be DEFLATED
            -zlib.MAX_WBITS,      # window size in bits:
                                  #   -15..-8: negate, suppress header
                                  #   8..15: normal
                                  #   16..30: subtract 16, gzip header
            zlib.DEF_MEM_LEVEL,   # mem level: 1..8/9
            0                     # strategy:
                                  #   0 = Z_DEFAULT_STRATEGY
                                  #   1 = Z_FILTERED
                                  #   2 = Z_HUFFMAN_ONLY
                                  #   3 = Z_RLE
                                  #   4 = Z_FIXED
    )
    deflated = compress.compress(data)
    deflated += compress.flush()
    return deflated

def inflate(data):
    decompress = zlib.decompressobj(
            -zlib.MAX_WBITS  # see above
    )
    inflated = decompress.decompress(data)
    inflated += decompress.flush()
    return inflated

I don't know if this corresponds exactly to whatever your server requires, but those two functions are able to round-trip any data I tried.

The parameters maps directly to what is passed to the zlib library functions.

PythonC
zlib.compressobj(...)deflateInit(...)
compressobj.compress(...)deflate(...)
zlib.decompressobj(...)inflateInit(...)
decompressobj.decompress(...)inflate(...)

The constructors create the structure and populate it with default values, and pass it along to the init-functions. The compress/decompress methods update the structure and pass it to inflate/deflate.

like image 153
Markus Jarderot Avatar answered Oct 12 '22 07:10

Markus Jarderot


This is an add-on to MizardX's answer, giving some explanation and background.

See http://www.chiramattel.com/george/blog/2007/09/09/deflatestream-block-length-does-not-match.html

According to RFC 1950, a zlib stream constructed in the default manner is composed of:

  • a 2-byte header (e.g. 0x78 0x9C)
  • a deflate stream -- see RFC 1951
  • an Adler-32 checksum of the uncompressed data (4 bytes)

The C# DeflateStream works on (you guessed it) a deflate stream. MizardX's code is telling the zlib module that the data is a raw deflate stream.

Observations: (1) One hopes the C# "deflation" method producing a longer string happens only with short input (2) Using the raw deflate stream without the Adler-32 checksum? Bit risky, unless replaced with something better.

Updates

error message Block length does not match with its complement

If you are trying to inflate some compressed data with the C# DeflateStream and you get that message, then it is quite possible that you are giving it a a zlib stream, not a deflate stream.

See How do you use a DeflateStream on part of a file?

Also copy/paste the error message into a Google search and you will get numerous hits (including the one up the front of this answer) saying much the same thing.

The Java Deflater ... used by "the website" ... C# DeflateStream "is pretty straightforward and has been tested against the Java implementation". Which of the following possible Java Deflater constructors is the website using?

public Deflater(int level, boolean nowrap)

Creates a new compressor using the specified compression level. If 'nowrap' is true then the ZLIB header and checksum fields will not be used in order to support the compression format used in both GZIP and PKZIP.

public Deflater(int level)

Creates a new compressor using the specified compression level. Compressed data will be generated in ZLIB format.

public Deflater()

Creates a new compressor with the default compression level. Compressed data will be generated in ZLIB format.

A one-line deflater after throwing away the 2-byte zlib header and the 4-byte checksum:

uncompressed_string.encode('zlib')[2:-4] # does not work in Python 3.x

or

zlib.compress(uncompressed_string)[2:-4]
like image 25
John Machin Avatar answered Oct 12 '22 09:10

John Machin