Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I work with Gzip files which contain extra data?

Tags:

python

gzip

I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file. If the gzip header is removed and the deflate stream fed directly to zlib, I instead get Error -3 while decompressing data: incorrect header check. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.

It strikes me as odd that Python cannot work with these files for two reasons:

  1. Both Gzip and 7zip are able to open these "padded" files without issue. (Gzip produces the message decompression OK, trailing garbage ignored, 7zip succeeds silently.)
  2. Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)

    Gzip's format.txt:

    It must be possible to detect the end of the compressed data with any compression method, regardless of the actual size of the compressed data. In particular, the decompressor must be able to detect and skip extra data appended to a valid compressed file on a record-oriented file system, or when the compressed data can only be read from a device in multiples of a certain block size.

    Python's gzip.GzipFile`:

    Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.

    Python's zlib.Decompress.unused_data:

    A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.

    The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.

Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)

# approach 1 - gzip.open
with gzip.open(filename) as datafile:
    data = datafile.read()

# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
    with gzip.GzipFile(fileobj=gzipfile) as datafile:
        data = datafile.read()

# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
    data = zlib.decompress(gzipfile.read()[10:])

# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
    decompressor = zlib.decompressobj()
    data = decompressor.decompress(gzipfile.read()[10:])

Am I doing something wrong?

UPDATE

Okay, while the problem with gzip seems to be a bug in the module, my zlib problems are self-inflicted. ;-)

While digging into gzip.py I realized what I was doing wrong — by default, zlib.decompress et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits, you can tell zlib to skip the zlib header and decrompress the raw stream. Both of these work:

# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
    data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)

# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
    decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
    data = decompressor.decompress(gzipfile.read()[10:])
like image 879
Ben Blank Avatar asked Feb 08 '11 01:02

Ben Blank


2 Answers

This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.

The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.

Of course, it is valid to concatenate two gzip files, eg:

echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing

The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.

There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.

There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.

like image 183
Glenn Maynard Avatar answered Nov 16 '22 02:11

Glenn Maynard


I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.

like image 33
Keith Avatar answered Nov 16 '22 02:11

Keith