I am creating an application related to files. And I was looking for ways to compute checksums for files. I want to know what's the best hashing method to calculate checksums of files md5 or SHA-1 or something else based on this criterias
Please feel free to as elaborative as possible.
It depends on your use case.
If you're only worried about accidental collisions, both MD5 and SHA-1 are fine, and MD5 is generally faster. In fact, MD4 is also sufficient for most use cases, and usually even faster… but it isn't as widely-implemented. (In particular, it isn't in hashlib.algorithms_guaranteed
… although it should be in hashlib_algorithms_available
on most stock Mac, Windows, and Linux builds.)
On the other hand, if you're worried about intentional attacks—i.e., someone intentionally crafting a bogus file that matches your hash—you have to consider the value of what you're protecting. MD4 is almost definitely not sufficient, MD5 is probably not sufficient, but SHA-1 is borderline. At present, Keccak (which will soon by SHA-3) is believed to be the best bet, but you'll want to stay on top of this, because things change every year.
The Wikipedia page on Cryptographic hash function has a table that's usually updated pretty frequently. To understand the table:
To generate a collision against an MD4 requires only 3 rounds, while MD5 requires about 2 million, and SHA-1 requires 15 trillion. That's enough that it would cost a few million dollars (at today's prices) to generate a collision. That may or may not be good enough for you, but it's not good enough for NIST.
Also, remember that "generally faster" isn't nearly as important as "tested faster on my data and platform". With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1MB random bytes
object, then did this:
In [173]: md4 = hashlib.new('md4')
In [174]: md5 = hashlib.new('md5')
In [175]: sha1 = hashlib.new('sha1')
In [180]: %timeit md4.update(data)
1000 loops, best of 3: 1.54 ms per loop
In [181]: %timeit md5.update(data)
100 loops, best of 3: 2.52 ms per loop
In [182]: %timeit sha1.update(data)
100 loops, best of 3: 2.94 ms per loop
As you can see, md4
is significantly faster than the others.
Tests using hashlib.md5()
instead of hashlib.new('md5')
, and using bytes
with less entropy (runs of 1-8 string.ascii_letters
separated by spaces) didn't show any significant differences.
And, for the hash algorithms that came with my installation, as tested below, nothing beat md4.
for x in hashlib.algorithms_available:
h = hashlib.new(x)
print(x, timeit.timeit(lambda: h.update(data), number=100))
If speed is really important, there's a nice trick you can use to improve on this: Use a bad, but very fast, hash function, like zlib.adler32
, and only apply it to the first 256KB of each file. (For some file types, the last 256KB, or the 256KB nearest the middle without going over, etc. might be better than the first.) Then, if you find a collision, generate MD4/SHA-1/Keccak/whatever hashes on the whole file for each file.
Finally, since someone asked in a comment how to hash a file without reading the whole thing into memory:
def hash_file(path, algorithm='md5', bufsize=8192):
h = hashlib.new(algorithm)
with open(path, 'rb') as f:
block = f.read(bufsize)
if not block:
break
h.update(block)
return h.digest()
If squeezing out every bit of performance is important, you'll want to experiment with different values for bufsize
on your platform (powers of two from 4KB to 8MB). You also might want to experiment with using raw file handles (os.open
and os.read
), which may sometimes be faster on some platforms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With