Python decompressing gzip chunk-by-chunk

Tags:

I've a memory- and disk-limited environment where I need to decompress the contents of a gzip file sent to me in string-based chunks (over xmlrpc binary transfer). However, using the zlib.decompress() or zlib.decompressobj()/decompress() both barf over the gzip header. I've tried offsetting past the gzip header (documented here), but still haven't managed to avoid the barf. The gzip library itself only seems to support decompressing from files.

The following snippet gives a simplified illustration of what I would like to do (except in real life the buffer will be filled from xmlrpc, rather than reading from a local file):

#! /usr/bin/env python  import zlib  CHUNKSIZE=1000  d = zlib.decompressobj()  f=open('23046-8.txt.gz','rb') buffer=f.read(CHUNKSIZE)  while buffer:   outstr = d.decompress(buffer)   print(outstr)   buffer=f.read(CHUNKSIZE)  outstr = d.flush() print(outstr)  f.close()

Unfortunately, as I said, this barfs with:

Traceback (most recent call last):   File "./test.py", line 13, in <module>     outstr = d.decompress(buffer) zlib.error: Error -3 while decompressing: incorrect header check

Theoretically, I could feed my xmlrpc-sourced data into a StringIO and then use that as a fileobj for gzip.GzipFile(), however, in real life, I don't have memory available to hold the entire file contents in memory as well as the decompressed data. I really do need to process it chunk-by-chunk.

The fall-back would be to change the compression of my xmlrpc-sourced data from gzip to plain zlib, but since that impacts other sub-systems I'd prefer to avoid it if possible.

Any ideas?

276

asked Mar 11 '10 09:03

user291294

2 Answers

gzip and zlib use slightly different headers.

See How can I decompress a gzip stream with zlib?

Try d = zlib.decompressobj(16+zlib.MAX_WBITS).

And you might try changing your chunk size to a power of 2 (say CHUNKSIZE=1024) for possible performance reasons.

answered Oct 05 '22 07:10

wisty

I've got a more detailed answer here: https://stackoverflow.com/a/22310760/1733117

d = zlib.decompressobj(zlib.MAX_WBITS|32)

per documentation this automatically detects the header (zlib or gzip).

answered Oct 05 '22 05:10

dnozay

Related questions
                            
                                How I call an async function without await?
                            
                                How does adaptive pooling in pytorch work?
                            
                                Spark RDD - Mapping with extra arguments
                            
                                numpy-equivalent of list.pop?
                            
                                Upgrading python3.4 to python3.6 on ubuntu breaks pip
                            
                                Evaluate sympy expression from an array of values
                            
                                lambda *args, **kwargs: None
                            
                                sorting points to form a continuous line
                            
                                pip how to remove incorrectly installed package with a leading dash: "-pkgname"
                            
                                Indexing numpy array with another numpy array
                            
                                Login to website using urllib2 - Python 2.7
                            
                                What's the simplest way to put a python script into the system tray (Windows)
                            
                                Iterate over pairs in a list (circular fashion) in Python
                            
                                Atlassian Bamboo with Django & Python - Possible?
                            
                                README extension for Python projects
                            
                                Pip: Specifying minor version
                            
                                Setting exit code in Python when an exception is raised
                            
                                Is there a max size, max no. of columns, max rows?
                            
                                Did something about `namedtuple` change in 3.5.1?
                            
                                Copy numpy array into part of another array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python decompressing gzip chunk-by-chunk

Tags:

python

gzip

zlib

user291294

People also ask

2 Answers

wisty

dnozay

Recent Activity

Donate For Us