This is in reference to Get MD5 hash of big files in Python and Hashlib in Windows and Linux
In responses to both these questions, it is advised to use larger chunks of data in the function md5.update() to improve performance.
All testing I have done appears to indicate that use of smaller chunks gives the best performance.
Consider the following code:
def test(factor):
filehash = hashlib.md5()
blk_size_to_read = filehash.block_size * (2**factor)
with open(largetestfile, 'rb') as f:
read_data = f.read(blk_size_to_read)
filehash.update(read_data)
filehash.digest()
if __name__ == '__main__':
for ctr in xrange(0, 12):
funcstr = "test({})".format(str(ctr))
timetaken = timeit.timeit(funcstr, setup="from __main__ import test", number = 5000)
print "Factor: {} Time: {}".format(str(ctr), str(timetaken))
All tests I have done indicate that the best performance is achieved when using factor
0 or 1 (that is, 64 or 128 bytes).
Any reason why I am seeing different results from those indicated in the questions quoted?
I have tried binary and plain text files with size ranging from 700MB to 1.2GB and am using Python 2.7.3 on Ubuntu 12.04
Secondary question: am I using timeit the way it should be?
The Python hashlib module is an interface for hashing messages easily. This contains numerous methods which will handle hashing any raw message in an encrypted format. The core purpose of this module is to use a hash function on a string, and encrypt it so that it is very difficult to decrypt it.
Tags:cryptography | hashing | md5 | python. The MD5, defined in RFC 1321, is a hash algorithm to turn inputs into a fixed 128-bit (16 bytes) length of the hash value. Note. MD5 is not collision-resistant – Two different inputs may producing the same hash value.
hexdigest() : the digest is returned as a string object of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments. – metatoaster.
Found the error! I was read
ing just one chunk and then doing nothing!
Changed
with open(largetestfile, 'rb') as f:
read_data = f.read(blk_size_to_read)
filehash.update(read_data)
to
with open(testfile, 'rb') as f:
while (True):
read_data = f.read(blk_size_to_read)
if not read_data:
break
filehash.update(read_data)
to fix the issue.
UPDATE:
I ran a slightly modified version of the program above to establish the best possible size of buffer to be used when incrementally using update() to find the hash of a given file. I also wanted to establish whether there was any benefit in incremental hashing rather than calculating the hash of the file in one go (other than memory constraints).
I created 20 files (with random data) for this with file size starting from 4096 bytes and upto 2.1 GB. md5 hash for each of these files was calculated using buffer sizes starting 2**6
bytes (64 bytes - block size) upto 2**20
bytes. Using timeit each of these was run 100 times and execution times obtained with the shortest execution time being recorded. Execution time for hash calculation of the entire file in one go was was also recorded.
The results are as follows...
FileName Filesize Chunksize Chunked Time Complete Time %diff
file5.txt 4096 4096 0.0014789 0.0014701 -0.60%
file6.txt 8192 524288 0.0021310 0.0021060 -1.19%
file7.txt 16384 16384 0.0033200 0.0033162 -0.12%
file8.txt 32768 65536 0.0061381 0.0057440 -6.86%
file9.txt 65536 65536 0.0106990 0.0112500 4.90%
file10.txt 131072 131072 0.0203800 0.0206621 1.37%
file11.txt 262144 524288 0.0396681 0.0401120 1.11%
file12.txt 524288 1048576 0.0780780 0.0787551 0.86%
file13.txt 1048576 1048576 0.1552539 0.1564729 0.78%
file14.txt 2097152 262144 0.3101590 0.3167789 2.09%
file15.txt 4194304 65536 0.6295781 0.6477270 2.80%
file16.txt 8388608 524288 1.2633710 1.3030031 3.04%
file17.txt 16777216 524288 2.5265670 2.5925691 2.55%
file18.txt 33554432 65536 5.0558681 5.8452392 13.50%
file19.txt 67108864 65536 10.1133211 11.6993010 13.56%
file20.txt 134217728 524288 20.2226040 23.3923230 13.55%
file21.txt 268435456 65536 40.4060180 46.6972852 13.47%
file22.txt 536870912 65536 80.9403431 93.4165111 13.36%
file23.txt 1073741824 524288 161.8108051 187.1303582 13.53%
file24.txt 2147483648 65536 323.4812710 374.3899529 13.60%
The Chunked Time
is execution time when the file is broken into chuck and hased incrementally; the Complete Time
is execution time when the entire file is hashed in one go. The %diff
is the percentage difference between Chunked Time and 'Complete Time'.
Observations:
2**25
) bytes and above), there appears to be considerable performance benefit (lesser time) in using the incremental approach rather than hashing the entire file in one go.2**16
) bytesNotes: python 2.7.3; Ubuntu 12.06 64 bit; 8 Gigs of RAM The code used for this is available here... http://pastebin.com/VxH7bL2X
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With