Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashlib: optimal size of chunks to be used in md5.update()

This is in reference to Get MD5 hash of big files in Python and Hashlib in Windows and Linux

In responses to both these questions, it is advised to use larger chunks of data in the function md5.update() to improve performance.

All testing I have done appears to indicate that use of smaller chunks gives the best performance.

Consider the following code:

def test(factor):
    filehash = hashlib.md5()
    blk_size_to_read = filehash.block_size * (2**factor)
    with open(largetestfile, 'rb') as f:
        read_data = f.read(blk_size_to_read)
        filehash.update(read_data)
    filehash.digest()

if __name__ == '__main__':
    for ctr in xrange(0, 12):
        funcstr = "test({})".format(str(ctr))
        timetaken = timeit.timeit(funcstr, setup="from __main__ import test", number = 5000)
        print "Factor: {} Time: {}".format(str(ctr), str(timetaken))

All tests I have done indicate that the best performance is achieved when using factor 0 or 1 (that is, 64 or 128 bytes).

Any reason why I am seeing different results from those indicated in the questions quoted?

I have tried binary and plain text files with size ranging from 700MB to 1.2GB and am using Python 2.7.3 on Ubuntu 12.04

Secondary question: am I using timeit the way it should be?

like image 638
Verma Avatar asked Jul 18 '13 18:07

Verma


People also ask

What does Hashlib do in Python?

The Python hashlib module is an interface for hashing messages easily. This contains numerous methods which will handle hashing any raw message in an encrypted format. The core purpose of this module is to use a hash function on a string, and encrypt it so that it is very difficult to decrypt it.

What is MD5 in Python?

Tags:cryptography | hashing | md5 | python. The MD5, defined in RFC 1321, is a hash algorithm to turn inputs into a fixed 128-bit (16 bytes) length of the hash value. Note. MD5 is not collision-resistant – Two different inputs may producing the same hash value.

What is Hexdigest ()?

hexdigest() : the digest is returned as a string object of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments. – metatoaster.


1 Answers

Found the error! I was reading just one chunk and then doing nothing!

Changed

with open(largetestfile, 'rb') as f:
    read_data = f.read(blk_size_to_read)
    filehash.update(read_data)

to

with open(testfile, 'rb') as f:
    while (True):
        read_data = f.read(blk_size_to_read)
        if not read_data:
            break
        filehash.update(read_data)

to fix the issue.

UPDATE:

I ran a slightly modified version of the program above to establish the best possible size of buffer to be used when incrementally using update() to find the hash of a given file. I also wanted to establish whether there was any benefit in incremental hashing rather than calculating the hash of the file in one go (other than memory constraints).

I created 20 files (with random data) for this with file size starting from 4096 bytes and upto 2.1 GB. md5 hash for each of these files was calculated using buffer sizes starting 2**6 bytes (64 bytes - block size) upto 2**20 bytes. Using timeit each of these was run 100 times and execution times obtained with the shortest execution time being recorded. Execution time for hash calculation of the entire file in one go was was also recorded.

The results are as follows...

FileName           Filesize       Chunksize      Chunked Time   Complete Time       %diff
file5.txt                 4096           4096      0.0014789      0.0014701         -0.60%
file6.txt                 8192         524288      0.0021310      0.0021060         -1.19%
file7.txt                16384          16384      0.0033200      0.0033162         -0.12%
file8.txt                32768          65536      0.0061381      0.0057440         -6.86%
file9.txt                65536          65536      0.0106990      0.0112500          4.90%
file10.txt              131072         131072      0.0203800      0.0206621          1.37%
file11.txt              262144         524288      0.0396681      0.0401120          1.11%
file12.txt              524288        1048576      0.0780780      0.0787551          0.86%
file13.txt             1048576        1048576      0.1552539      0.1564729          0.78%
file14.txt             2097152         262144      0.3101590      0.3167789          2.09%
file15.txt             4194304          65536      0.6295781      0.6477270          2.80%
file16.txt             8388608         524288      1.2633710      1.3030031          3.04%
file17.txt            16777216         524288      2.5265670      2.5925691          2.55%
file18.txt            33554432          65536      5.0558681      5.8452392         13.50%
file19.txt            67108864          65536     10.1133211     11.6993010         13.56%
file20.txt           134217728         524288     20.2226040     23.3923230         13.55%
file21.txt           268435456          65536     40.4060180     46.6972852         13.47%
file22.txt           536870912          65536     80.9403431     93.4165111         13.36%
file23.txt          1073741824         524288    161.8108051    187.1303582         13.53%
file24.txt          2147483648          65536    323.4812710    374.3899529         13.60%

The Chunked Time is execution time when the file is broken into chuck and hased incrementally; the Complete Time is execution time when the entire file is hashed in one go. The %diff is the percentage difference between Chunked Time and 'Complete Time'.

Observations:

  1. For smaller file sizes the chunk size is nearly always equal to file size and there appears to be no advantage in adopting either approach.
  2. For larger files (33554432 (2**25) bytes and above), there appears to be considerable performance benefit (lesser time) in using the incremental approach rather than hashing the entire file in one go.
  3. For larger files it the best chunk/buffer size is 65536 (2**16) bytes

Notes: python 2.7.3; Ubuntu 12.06 64 bit; 8 Gigs of RAM The code used for this is available here... http://pastebin.com/VxH7bL2X

like image 152
Verma Avatar answered Sep 23 '22 10:09

Verma