As explain in this article https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49 the md5 of two .tar.gz files that are the compression of the exact same set of files can be different. This is because it, for example, includes timestamp in the header of the compressed file.
In the article 3 solutions are proposed, and I would idealy like to use the first one wich is :
We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;
And this solution works well:
tar -c ./bin |gzip -n >one.tar.gz
tar -c ./bin |gzip -n >two.tar.gz
md5sum one.tgz two.tgz
Nevertheless I have no idea of what will be a good way to do it in python. Is there a way to do it with tarfile(https://docs.python.org/2/library/tarfile.html)?
As with all such hashing algorithms, there is theoretically an unlimited number of files that will have any given MD5 hash. However, it is very unlikely that any two non-identical files in the real world will have the same MD5 hash, unless they have been specifically created to have the same hash.
Generally, two files can have the same md5 hash only if their contents are exactly the same. Even a single bit of variation will generate a completely different hash value. There is one caveat, though: An md5 sum is 128 bits (16 bytes).
So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different. Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes.
Martin's answer is correct, but in my case I wanted to ignore the last modified date of each file in the tar as well, so that even if a file was "modified" but with no actual changes, it still has the same hash.
When creating the tar, I can override values I don't care about so they are always the same.
In this example I show that just using a normal tar.bz2, if I re-create my source file with a new creation timestamp, the hash will change (1 and 2 are the same, after re-creation, 4 will differ). However, if I set the time to Unix Epoch 0 (or any other arbitrary time), my files will all hash the same (3, 5 and 6)
To do this you need to pass a filter
function to tar.add(DIR, filter=tarInfoStripFileAttrs)
that removes the desired fields, as in the example below
import tarfile, time, os
def createTestFile():
with open(DIR + "/someFile.txt", "w") as file:
file.write("test file")
# Takes in a TarInfo and returns the modified TarInfo:
# https://docs.python.org/3/library/tarfile.html#tarinfo-objects
# intented to be passed as a filter to tarfile.add
# https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.add
def tarInfoStripFileAttrs(tarInfo):
# set time to epoch timestamp 0, aka 00:00:00 UTC on 1 January 1970
# note that when extracting this tarfile, this time will be shown as the modified date
tarInfo.mtime = 0
# file permissions, probably don't want to remove this, but for some use cases you could
# tarInfo.mode = 0
# user/group info
tarInfo.uid= 0
tarInfo.uname = ''
tarInfo.gid= 0
tarInfo.gname = ''
# stripping paxheaders may not be required
# see https://stackoverflow.com/questions/34688392/paxheaders-in-tarball
tarInfo.pax_headers = {}
return tarInfo
# COMPRESSION_TYPE = "gz" # does not work even with filter
COMPRESSION_TYPE = "bz2"
DIR = "toTar"
if not os.path.exists(DIR):
os.mkdir(DIR)
createTestFile()
tar1 = tarfile.open("one.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar1.add(DIR)
tar1.close()
tar2 = tarfile.open("two.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar2.add(DIR)
tar2.close()
tar3 = tarfile.open("three.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar3.add(DIR, filter=tarInfoStripFileAttrs)
tar3.close()
# Overwrite the file with the same content, but an updated time
time.sleep(1)
createTestFile()
tar4 = tarfile.open("four.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar4.add(DIR)
tar4.close()
tar5 = tarfile.open("five.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar5.add(DIR, filter=tarInfoStripFileAttrs)
tar5.close()
tar6 = tarfile.open("six.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar6.add(DIR, filter=tarInfoStripFileAttrs)
tar6.close()
$ md5sum one.tar.bz2 two.tar.bz2 three.tar.bz2 four.tar.bz2 five.tar.bz2 six.tar.bz2
0e51c97a8810e45b78baeb1677c3f946 one.tar.bz2 # same as 2
0e51c97a8810e45b78baeb1677c3f946 two.tar.bz2 # same as 1
54a38d35d48d4aa1bd68e12cf7aee511 three.tar.bz2 # same as 5/6
22cf1161897377eefaa5ba89e3fa6acd four.tar.bz2 # would be same as 1/2, but timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 five.tar.bz2 # same as 3, even though timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 six.tar.bz2 # same as 3, even though timestamp has changed
You may want to tweak which params are modified and how in your filter function based on your use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With