Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to create archive whose keep same md5 hash for identical content in python?

As explain in this article https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49 the md5 of two .tar.gz files that are the compression of the exact same set of files can be different. This is because it, for example, includes timestamp in the header of the compressed file.

In the article 3 solutions are proposed, and I would idealy like to use the first one wich is :

We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;

And this solution works well:

tar -c ./bin |gzip -n >one.tar.gz
tar -c ./bin |gzip -n >two.tar.gz
md5sum one.tgz two.tgz

Nevertheless I have no idea of what will be a good way to do it in python. Is there a way to do it with tarfile(https://docs.python.org/2/library/tarfile.html)?

like image 277
Sophie Jacquin Avatar asked Jul 11 '17 13:07

Sophie Jacquin


People also ask

Can two files have same md5?

As with all such hashing algorithms, there is theoretically an unlimited number of files that will have any given MD5 hash. However, it is very unlikely that any two non-identical files in the real world will have the same MD5 hash, unless they have been specifically created to have the same hash.

Can two different strings have same md5 hash?

Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value. There is one caveat, though: An md5 sum is 128 bits (16 bytes).

Does copying a file change the hash?

So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different. Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes.


1 Answers

Martin's answer is correct, but in my case I wanted to ignore the last modified date of each file in the tar as well, so that even if a file was "modified" but with no actual changes, it still has the same hash.

When creating the tar, I can override values I don't care about so they are always the same.

In this example I show that just using a normal tar.bz2, if I re-create my source file with a new creation timestamp, the hash will change (1 and 2 are the same, after re-creation, 4 will differ). However, if I set the time to Unix Epoch 0 (or any other arbitrary time), my files will all hash the same (3, 5 and 6)

To do this you need to pass a filter function to tar.add(DIR, filter=tarInfoStripFileAttrs) that removes the desired fields, as in the example below

import tarfile, time, os

def createTestFile():
    with open(DIR + "/someFile.txt", "w") as file:
        file.write("test file")

# Takes in a TarInfo and returns the modified TarInfo:
# https://docs.python.org/3/library/tarfile.html#tarinfo-objects
# intented to be passed as a filter to tarfile.add
# https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.add
def tarInfoStripFileAttrs(tarInfo):
    # set time to epoch timestamp 0, aka 00:00:00 UTC on 1 January 1970
    # note that when extracting this tarfile, this time will be shown as the modified date
    tarInfo.mtime = 0

    # file permissions, probably don't want to remove this, but for some use cases you could
    # tarInfo.mode = 0

    # user/group info
    tarInfo.uid= 0
    tarInfo.uname = ''
    tarInfo.gid= 0
    tarInfo.gname = ''

    # stripping paxheaders may not be required
    # see https://stackoverflow.com/questions/34688392/paxheaders-in-tarball
    tarInfo.pax_headers = {}

    return tarInfo


# COMPRESSION_TYPE = "gz" # does not work even with filter
COMPRESSION_TYPE = "bz2"
DIR = "toTar"
if not os.path.exists(DIR):
    os.mkdir(DIR)

createTestFile()

tar1 = tarfile.open("one.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar1.add(DIR)
tar1.close()

tar2 = tarfile.open("two.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar2.add(DIR)
tar2.close()

tar3 = tarfile.open("three.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar3.add(DIR, filter=tarInfoStripFileAttrs)
tar3.close()

# Overwrite the file with the same content, but an updated time
time.sleep(1)
createTestFile()

tar4 = tarfile.open("four.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar4.add(DIR)
tar4.close()


tar5 = tarfile.open("five.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar5.add(DIR, filter=tarInfoStripFileAttrs)
tar5.close()

tar6 = tarfile.open("six.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar6.add(DIR, filter=tarInfoStripFileAttrs)
tar6.close()
$ md5sum one.tar.bz2 two.tar.bz2 three.tar.bz2 four.tar.bz2 five.tar.bz2 six.tar.bz2
0e51c97a8810e45b78baeb1677c3f946  one.tar.bz2      # same as 2
0e51c97a8810e45b78baeb1677c3f946  two.tar.bz2      # same as 1
54a38d35d48d4aa1bd68e12cf7aee511  three.tar.bz2    # same as 5/6
22cf1161897377eefaa5ba89e3fa6acd  four.tar.bz2     # would be same as 1/2, but timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511  five.tar.bz2     # same as 3, even though timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511  six.tar.bz2      # same as 3, even though timestamp has changed

You may want to tweak which params are modified and how in your filter function based on your use case.

like image 141
Patrick Fay Avatar answered Dec 04 '22 03:12

Patrick Fay