Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python md5 hashes of same gzipped file are inconsistent

I am trying to zip a file using the python module gzip, and then hash the gzipped filed using hashlib. I have the following code:

import hashlib
import gzip

f_name = 'read_x.fastq'

for x in range(0,3):

    file = open(f_name, 'rb')

    myzip = gzip.open('test.gz', 'wb', compresslevel=1)

    n = 100000000
    try:
        print 'zipping ' + str(x)
        for chunk in iter(lambda: file.read(n), ''):
            myzip.write(chunk)
    finally:
        file.close()
        myzip.close()

    md5 = hashlib.md5()
    print 'hashing ' + str(x)
    with open('test.gz', 'r') as f:
        for chunk in iter(lambda: f.read(n), ''):
            md5.update(chunk)

    print md5.hexdigest()
    print '\n'

which I thought should simply zip the file, hash it and display the same output hash three times in a row. However, the output I get is:

zipping 0
hashing 0
7bd80798bce074c65928e0cf9d66cae4


zipping 1
hashing 1
a3bd4e126e0a156c5d86df75baffc294


zipping 2
hashing 2
85812a39f388c388cb25a35c4fac87bf

If I leave out the gzip step, and just hash the same gzipped file three times in a row, I do indeed get the same output three times:

hashing 0
ccfddd10c8fd1140db0b218124e7e9d3


hashing 1
ccfddd10c8fd1140db0b218124e7e9d3


hashing 2
ccfddd10c8fd1140db0b218124e7e9d3

Can anyone explain what is going on here? The issue must be that the gzip process is different each time. But as far as I knew, the DEFLATE algorithm is Huffman coding followed by LZ77 (a form of run-length-encoding) or LZ77 followed by Huffman, and therefore given identical input should produce identical output.

like image 754
shaw2thefloor Avatar asked Jan 29 '15 11:01

shaw2thefloor


2 Answers

There are several reasons why compressing the exact same content will produce different gzip outputs:

  • compression level. This you can control via the compress level parameter.
  • The name of the original file which is in the header. This you can control if you use the gzip.GzipFile api rather than the gzip.open api.
  • The modification time which is also in the header and can also be controlled with the gzip.GzipFile api.

So here is a piece of code that demonstrated the wrong and the right way to get reproducible output from python gzip:

import hashlib
import gzip

f_name = '/etc/passwd'
output_template = '/tmp/test{}.gz'

def digest(filename: str) -> str:
    md5 = hashlib.md5()
    with open(output_filename, 'rb') as f:
        for chunk in iter(lambda: f.read(block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

print("The default way - non identical outputs")
for x in range(0,3):
    input_handle = open(f_name, 'rb')
    output_filename = output_template.format(x)
    myzip = gzip.open(output_filename, 'wb')
    block_size = 4096
    try:
        for chunk in iter(lambda: input_handle.read(block_size), b''):
            myzip.write(chunk)
    finally:
        input_handle.close()
        myzip.close()
    print(digest(output_filename))

print("The right way to get identical outputs")
for x in range(3,6):
    input_handle = open(f_name, 'rb')
    output_filename = output_template.format(x)
    myzip = gzip.GzipFile(
        filename='',  # do not emit filename into the output gzip file
        mode='wb',
        fileobj=open(output_filename, 'wb'),
        mtime=0,
    )
    block_size = 4096
    try:
        for chunk in iter(lambda: input_handle.read(block_size), b''):
            myzip.write(chunk)
    finally:
        input_handle.close()
        myzip.close()
    print(digest(output_filename))
like image 101
Mark Veltzer Avatar answered Nov 14 '22 22:11

Mark Veltzer


Oh but wait....apparently gzip adds timestamp info to the header of the gzip file, so the hash would be different.

like image 23
shaw2thefloor Avatar answered Nov 14 '22 21:11

shaw2thefloor