I'm interested in compressing data using Python's gzip
module. It happens that I want the compressed output to be deterministic, because that's often a really convenient property for things to have in general -- if some non-gzip-aware process is going to be looking for changes in the output, say, or if the output is going to be cryptographically signed.
Unfortunately, the output is different every time. As far as I can tell, the only reason for this is the timestamp field in the gzip header, which the Python module always populates with the current time. I don't think you're actually allowed to have a gzip stream without a timestamp in it, which is too bad.
In any case, there doesn't seem to be a way for the caller of Python's gzip
module to supply the correct modification time of the underlying data. (The actual gzip
program seems to use the timestamp of the input file when possible.) I imagine this is because basically the only thing that ever cares about the timestamp is the gunzip
command when writing to a file -- and, now, me, because I want deterministic output. Is that so much to ask?
Has anyone else encountered this problem?
What's the least terrible way to gzip
some data with an arbitrary timestamp from Python?
Yeah, you don't have any pretty options. The time is written with this line in _write_gzip_header:
write32u(self.fileobj, long(time.time()))
Since they don't give you a way to override the time, you can do one of these things:
_write_gzip_header
function into your derived class, but with a different value in this one line.Here's an example of option #2 (untested):
class FakeTime:
def time(self):
return 1225856967.109
import gzip
gzip.time = FakeTime()
# Now call gzip, it will think time doesn't change!
Option #5 may be the cleanest in terms of not depending on the internals of the gzip module (untested):
class GzipTimeFixingFile:
def __init__(self, realfile):
self.realfile = realfile
self.pos = 0
def write(self, bytes):
if self.pos == 4 and len(bytes) == 4:
self.realfile.write("XYZY") # Fake time goes here.
else:
self.realfile.write(bytes)
self.pos += len(bytes)
From Python 2.7 onwards you can specify the time to be used in the gzip header. N.B. filename is also included in the header and can also be specified manually.
import gzip
content = b"Some content"
f = open("/tmp/f.gz", "wb")
gz = gzip.GzipFile(fileobj=f,mode="wb",filename="",mtime=0)
gz.write(content)
gz.close()
f.close()
Submit a patch in which the computation of the time stamp is factored out. It would almost certainly be accepted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With