Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do the compression codecs work in Python?

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.

My code looks like this:

log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
    log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))

However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?

I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?


EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.

like image 854
Chris B. Avatar asked Sep 29 '10 17:09

Chris B.


People also ask

What does codecs do in Python?

Codec Objects. Encodes the object input and returns a tuple (output object, length consumed). Encoding converts a string object to a bytes object using a particular character set encoding (e.g., cp1252 or iso-8859-1). errors defines the error handling to apply.

What is compression in Python?

Compress(): This iterator selectively picks the values to print from the passed container according to the boolean list value passed as other arguments. The arguments corresponding to boolean true are printed else all are skipped. In this, we give two parameters to the function.

How do codecs work?

Codecs are essentially standards of video content compression. Codecs are made up of two components, an encoder to compress the content, and a decoder to decompress the video content and play an approximation of the original content. An enCOder and a DECoder, hence the name codec.

What is codecs open in Python?

The codecs. open() function works in parallel with the in-built open() function in Python and opens up files with a specific encoding. By default, it opens a file in the read mode. The codecs. open() function opens all files in binary mode, even if it isn't manually mentioned in the syntax of the code.


1 Answers

As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.

The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.

import bz2

class BZ2StreamEncoder(object):
    def __init__(self, filename, mode):
        self.log_file = open(filename, mode)
        self.encoder = bz2.BZ2Compressor()

    def write(self, data):
        self.log_file.write(self.encoder.compress(data))

    def flush(self):
        self.log_file.write(self.encoder.flush())
        self.log_file.flush()

    def close(self):
        self.flush()
        self.log_file.close()

log_file = BZ2StreamEncoder(archive_file, 'ab')

A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.

like image 182
Chris B. Avatar answered Sep 28 '22 01:09

Chris B.