How do the compression codecs work in Python?

Tags:

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.

My code looks like this:

Click to copy

log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
    log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))

However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?

I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?

EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.

854

asked Sep 29 '10 17:09

Chris B.

1 Answers

As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.

The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.

Click to copy

import bz2

class BZ2StreamEncoder(object):
    def __init__(self, filename, mode):
        self.log_file = open(filename, mode)
        self.encoder = bz2.BZ2Compressor()

    def write(self, data):
        self.log_file.write(self.encoder.compress(data))

    def flush(self):
        self.log_file.write(self.encoder.flush())
        self.log_file.flush()

    def close(self):
        self.flush()
        self.log_file.close()

log_file = BZ2StreamEncoder(archive_file, 'ab')

A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.

182

answered Sep 28 '22 01:09

Chris B.

Related questions
                            
                                Accelerate the loop
                            
                                How to cancel previous request in FastAPI
                            
                                Testing GUI code: should I use a mocking library?
                            
                                How to find out if there is data to be read from stdin on Windows in Python?
                            
                                Best way to organize the folders containing the SQLAlchemy models [closed]
                            
                                Find cpu-hogging plugin in multithreaded python
                            
                                Are there any examples on python-purple floating around?
                            
                                Regex Matching Error
                            
                                What is the best design for polling a modem for incoming data?
                            
                                With sqlalchemy how to dynamically bind to database engine on a per-request basis
                            
                                A web framework where AJAX was not an after thought
                            
                                BeautifulSoup, but for CSS?
                            
                                Indent guide plugin for gedit (python)
                            
                                Terminate subprocess in Windows, access denied
                            
                                Django localeURL when WSGIScriptAlias is /PREFIX
                            
                                How to detect source code in a text?
                            
                                How do I track a blob using OpenCV and Python
                            
                                using numpy in cython: defining ndarray datatype/ndims
                            
                                Is there a switch to ignore undefined namespace prefixes in LXML?
                            
                                Unique constraint using data in multiple tables (SQL / SQLAlchemy)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do the compression codecs work in Python?

Tags:

python

python-2.x

gzip

bzip2

Chris B.

People also ask

1 Answers

Chris B.

Recent Activity

Donate For Us