I am trying to zip a file using the python module gzip, and then hash the gzipped filed using hashlib. I have the following code:
import hashlib
import gzip
f_name = 'read_x.fastq'
for x in range(0,3):
file = open(f_name, 'rb')
myzip = gzip.open('test.gz', 'wb', compresslevel=1)
n = 100000000
try:
print 'zipping ' + str(x)
for chunk in iter(lambda: file.read(n), ''):
myzip.write(chunk)
finally:
file.close()
myzip.close()
md5 = hashlib.md5()
print 'hashing ' + str(x)
with open('test.gz', 'r') as f:
for chunk in iter(lambda: f.read(n), ''):
md5.update(chunk)
print md5.hexdigest()
print '\n'
which I thought should simply zip the file, hash it and display the same output hash three times in a row. However, the output I get is:
zipping 0
hashing 0
7bd80798bce074c65928e0cf9d66cae4
zipping 1
hashing 1
a3bd4e126e0a156c5d86df75baffc294
zipping 2
hashing 2
85812a39f388c388cb25a35c4fac87bf
If I leave out the gzip step, and just hash the same gzipped file three times in a row, I do indeed get the same output three times:
hashing 0
ccfddd10c8fd1140db0b218124e7e9d3
hashing 1
ccfddd10c8fd1140db0b218124e7e9d3
hashing 2
ccfddd10c8fd1140db0b218124e7e9d3
Can anyone explain what is going on here? The issue must be that the gzip process is different each time. But as far as I knew, the DEFLATE algorithm is Huffman coding followed by LZ77 (a form of run-length-encoding) or LZ77 followed by Huffman, and therefore given identical input should produce identical output.
There are several reasons why compressing the exact same content will produce different gzip outputs:
So here is a piece of code that demonstrated the wrong and the right way to get reproducible output from python gzip:
import hashlib
import gzip
f_name = '/etc/passwd'
output_template = '/tmp/test{}.gz'
def digest(filename: str) -> str:
md5 = hashlib.md5()
with open(output_filename, 'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
md5.update(chunk)
return md5.hexdigest()
print("The default way - non identical outputs")
for x in range(0,3):
input_handle = open(f_name, 'rb')
output_filename = output_template.format(x)
myzip = gzip.open(output_filename, 'wb')
block_size = 4096
try:
for chunk in iter(lambda: input_handle.read(block_size), b''):
myzip.write(chunk)
finally:
input_handle.close()
myzip.close()
print(digest(output_filename))
print("The right way to get identical outputs")
for x in range(3,6):
input_handle = open(f_name, 'rb')
output_filename = output_template.format(x)
myzip = gzip.GzipFile(
filename='', # do not emit filename into the output gzip file
mode='wb',
fileobj=open(output_filename, 'wb'),
mtime=0,
)
block_size = 4096
try:
for chunk in iter(lambda: input_handle.read(block_size), b''):
myzip.write(chunk)
finally:
input_handle.close()
myzip.close()
print(digest(output_filename))
Oh but wait....apparently gzip adds timestamp info to the header of the gzip file, so the hash would be different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With