Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

File Checksums in Python

I am creating an application related to files. And I was looking for ways to compute checksums for files. I want to know what's the best hashing method to calculate checksums of files md5 or SHA-1 or something else based on this criterias

  • The checksum should be unique. I know its theoretical but still I want the probablity of collisions to be very very small.
  • Can compare two files to be equal if there checksums are equal or not.
  • Speed(not very important, but still)

Please feel free to as elaborative as possible.

like image 452
Saransh Mohapatra Avatar asked Jan 14 '23 06:01

Saransh Mohapatra


1 Answers

It depends on your use case.

If you're only worried about accidental collisions, both MD5 and SHA-1 are fine, and MD5 is generally faster. In fact, MD4 is also sufficient for most use cases, and usually even faster… but it isn't as widely-implemented. (In particular, it isn't in hashlib.algorithms_guaranteed… although it should be in hashlib_algorithms_available on most stock Mac, Windows, and Linux builds.)

On the other hand, if you're worried about intentional attacks—i.e., someone intentionally crafting a bogus file that matches your hash—you have to consider the value of what you're protecting. MD4 is almost definitely not sufficient, MD5 is probably not sufficient, but SHA-1 is borderline. At present, Keccak (which will soon by SHA-3) is believed to be the best bet, but you'll want to stay on top of this, because things change every year.

The Wikipedia page on Cryptographic hash function has a table that's usually updated pretty frequently. To understand the table:

To generate a collision against an MD4 requires only 3 rounds, while MD5 requires about 2 million, and SHA-1 requires 15 trillion. That's enough that it would cost a few million dollars (at today's prices) to generate a collision. That may or may not be good enough for you, but it's not good enough for NIST.


Also, remember that "generally faster" isn't nearly as important as "tested faster on my data and platform". With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1MB random bytes object, then did this:

In [173]: md4 = hashlib.new('md4')
In [174]: md5 = hashlib.new('md5')
In [175]: sha1 = hashlib.new('sha1')
In [180]: %timeit md4.update(data)
1000 loops, best of 3: 1.54 ms per loop
In [181]: %timeit md5.update(data)
100 loops, best of 3: 2.52 ms per loop
In [182]: %timeit sha1.update(data)
100 loops, best of 3: 2.94 ms per loop

As you can see, md4 is significantly faster than the others.

Tests using hashlib.md5() instead of hashlib.new('md5'), and using bytes with less entropy (runs of 1-8 string.ascii_letters separated by spaces) didn't show any significant differences.

And, for the hash algorithms that came with my installation, as tested below, nothing beat md4.

for x in hashlib.algorithms_available:
    h = hashlib.new(x)
    print(x, timeit.timeit(lambda: h.update(data), number=100))

If speed is really important, there's a nice trick you can use to improve on this: Use a bad, but very fast, hash function, like zlib.adler32, and only apply it to the first 256KB of each file. (For some file types, the last 256KB, or the 256KB nearest the middle without going over, etc. might be better than the first.) Then, if you find a collision, generate MD4/SHA-1/Keccak/whatever hashes on the whole file for each file.


Finally, since someone asked in a comment how to hash a file without reading the whole thing into memory:

def hash_file(path, algorithm='md5', bufsize=8192):
    h = hashlib.new(algorithm)
    with open(path, 'rb') as f:
        block = f.read(bufsize)
        if not block:
            break
        h.update(block)
    return h.digest()

If squeezing out every bit of performance is important, you'll want to experiment with different values for bufsize on your platform (powers of two from 4KB to 8MB). You also might want to experiment with using raw file handles (os.open and os.read), which may sometimes be faster on some platforms.

like image 76
abarnert Avatar answered Jan 17 '23 11:01

abarnert