I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip
module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file
. If the gzip header is removed and the deflate stream fed directly to zlib
, I instead get Error -3 while decompressing data: incorrect header check
. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.
It strikes me as odd that Python cannot work with these files for two reasons:
decompression OK, trailing garbage ignored
, 7zip succeeds silently.)Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)
Gzip's format.txt:
It must be possible to detect the end of the compressed data with any compression method, regardless of the actual size of the compressed data. In particular, the decompressor must be able to detect and skip extra data appended to a valid compressed file on a record-oriented file system, or when the compressed data can only be read from a device in multiples of a certain block size.
Python's gzip.GzipFile`:
Calling a
GzipFile
object’sclose()
method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass aStringIO
object opened for writing as fileobj, and retrieve the resulting memory buffer using theStringIO
object’sgetvalue()
method.
Python's zlib.Decompress.unused_data
:
A string which contains any bytes past the end of the compressed data. That is, this remains
""
until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is""
, the empty string.The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s
decompress()
method until theunused_data
attribute is no longer the empty string.
Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)
# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()
# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])
# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])
Am I doing something wrong?
UPDATE
Okay, while the problem with gzip
seems to be a bug in the module, my zlib
problems are self-inflicted. ;-)
While digging into gzip.py
I realized what I was doing wrong — by default, zlib.decompress
et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits
, you can tell zlib
to skip the zlib header and decrompress the raw stream. Both of these work:
# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)
# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])
This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.
The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.
Of course, it is valid to concatenate two gzip files, eg:
echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing
The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.
There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read
method. It should set another flag, eg. reading_second_block
, to tell _read_gzip_header
to raise EOFError
instead of IOError
.
There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.
I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With