Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

setting the gzip timestamp from Python

Tags:

python

gzip

I'm interested in compressing data using Python's gzip module. It happens that I want the compressed output to be deterministic, because that's often a really convenient property for things to have in general -- if some non-gzip-aware process is going to be looking for changes in the output, say, or if the output is going to be cryptographically signed.

Unfortunately, the output is different every time. As far as I can tell, the only reason for this is the timestamp field in the gzip header, which the Python module always populates with the current time. I don't think you're actually allowed to have a gzip stream without a timestamp in it, which is too bad.

In any case, there doesn't seem to be a way for the caller of Python's gzip module to supply the correct modification time of the underlying data. (The actual gzip program seems to use the timestamp of the input file when possible.) I imagine this is because basically the only thing that ever cares about the timestamp is the gunzip command when writing to a file -- and, now, me, because I want deterministic output. Is that so much to ask?

Has anyone else encountered this problem?

What's the least terrible way to gzip some data with an arbitrary timestamp from Python?

like image 425
zaphod Avatar asked Nov 05 '08 02:11

zaphod


3 Answers

Yeah, you don't have any pretty options. The time is written with this line in _write_gzip_header:

write32u(self.fileobj, long(time.time()))

Since they don't give you a way to override the time, you can do one of these things:

  1. Derive a class from GzipFile, and copy the _write_gzip_header function into your derived class, but with a different value in this one line.
  2. After importing the gzip module, assign new code to its time member. You will essentially be providing a new definition of the name time in the gzip code, so you can change what time.time() means.
  3. Copy the entire gzip module, and name it my_stable_gzip, and change the line you need to.
  4. Pass a CStringIO object in as fileobj, and modify the bytestream after gzip is done.
  5. Write a fake file object that keeps track of the bytes written, and passes everything through to a real file, except for the bytes for the timestamp, which you write yourself.

Here's an example of option #2 (untested):

class FakeTime:
    def time(self):
        return 1225856967.109

import gzip
gzip.time = FakeTime()

# Now call gzip, it will think time doesn't change!

Option #5 may be the cleanest in terms of not depending on the internals of the gzip module (untested):

class GzipTimeFixingFile:
    def __init__(self, realfile):
        self.realfile = realfile
        self.pos = 0

    def write(self, bytes):
        if self.pos == 4 and len(bytes) == 4:
            self.realfile.write("XYZY")  # Fake time goes here.
        else:
            self.realfile.write(bytes)
        self.pos += len(bytes)
like image 144
Ned Batchelder Avatar answered Oct 21 '22 11:10

Ned Batchelder


From Python 2.7 onwards you can specify the time to be used in the gzip header. N.B. filename is also included in the header and can also be specified manually.

import gzip

content = b"Some content"
f = open("/tmp/f.gz", "wb")
gz = gzip.GzipFile(fileobj=f,mode="wb",filename="",mtime=0)
gz.write(content)
gz.close()
f.close()
like image 27
Dominic Bevacqua Avatar answered Oct 21 '22 10:10

Dominic Bevacqua


Submit a patch in which the computation of the time stamp is factored out. It would almost certainly be accepted.

like image 2
Alex Coventry Avatar answered Oct 21 '22 10:10

Alex Coventry