Hashlib: optimal size of chunks to be used in md5.update()

Tags:

This is in reference to Get MD5 hash of big files in Python and Hashlib in Windows and Linux

In responses to both these questions, it is advised to use larger chunks of data in the function md5.update() to improve performance.

All testing I have done appears to indicate that use of smaller chunks gives the best performance.

Consider the following code:

def test(factor):
    filehash = hashlib.md5()
    blk_size_to_read = filehash.block_size * (2**factor)
    with open(largetestfile, 'rb') as f:
        read_data = f.read(blk_size_to_read)
        filehash.update(read_data)
    filehash.digest()

if __name__ == '__main__':
    for ctr in xrange(0, 12):
        funcstr = "test({})".format(str(ctr))
        timetaken = timeit.timeit(funcstr, setup="from __main__ import test", number = 5000)
        print "Factor: {} Time: {}".format(str(ctr), str(timetaken))

All tests I have done indicate that the best performance is achieved when using factor 0 or 1 (that is, 64 or 128 bytes).

Any reason why I am seeing different results from those indicated in the questions quoted?

I have tried binary and plain text files with size ranging from 700MB to 1.2GB and am using Python 2.7.3 on Ubuntu 12.04

Secondary question: am I using timeit the way it should be?

638

asked Jul 18 '13 18:07

Verma

1 Answers

Found the error! I was reading just one chunk and then doing nothing!

Changed

with open(largetestfile, 'rb') as f:
    read_data = f.read(blk_size_to_read)
    filehash.update(read_data)

with open(testfile, 'rb') as f:
    while (True):
        read_data = f.read(blk_size_to_read)
        if not read_data:
            break
        filehash.update(read_data)

to fix the issue.

UPDATE:

I ran a slightly modified version of the program above to establish the best possible size of buffer to be used when incrementally using update() to find the hash of a given file. I also wanted to establish whether there was any benefit in incremental hashing rather than calculating the hash of the file in one go (other than memory constraints).

I created 20 files (with random data) for this with file size starting from 4096 bytes and upto 2.1 GB. md5 hash for each of these files was calculated using buffer sizes starting 2**6 bytes (64 bytes - block size) upto 2**20 bytes. Using timeit each of these was run 100 times and execution times obtained with the shortest execution time being recorded. Execution time for hash calculation of the entire file in one go was was also recorded.

The results are as follows...

FileName           Filesize       Chunksize      Chunked Time   Complete Time       %diff
file5.txt                 4096           4096      0.0014789      0.0014701         -0.60%
file6.txt                 8192         524288      0.0021310      0.0021060         -1.19%
file7.txt                16384          16384      0.0033200      0.0033162         -0.12%
file8.txt                32768          65536      0.0061381      0.0057440         -6.86%
file9.txt                65536          65536      0.0106990      0.0112500          4.90%
file10.txt              131072         131072      0.0203800      0.0206621          1.37%
file11.txt              262144         524288      0.0396681      0.0401120          1.11%
file12.txt              524288        1048576      0.0780780      0.0787551          0.86%
file13.txt             1048576        1048576      0.1552539      0.1564729          0.78%
file14.txt             2097152         262144      0.3101590      0.3167789          2.09%
file15.txt             4194304          65536      0.6295781      0.6477270          2.80%
file16.txt             8388608         524288      1.2633710      1.3030031          3.04%
file17.txt            16777216         524288      2.5265670      2.5925691          2.55%
file18.txt            33554432          65536      5.0558681      5.8452392         13.50%
file19.txt            67108864          65536     10.1133211     11.6993010         13.56%
file20.txt           134217728         524288     20.2226040     23.3923230         13.55%
file21.txt           268435456          65536     40.4060180     46.6972852         13.47%
file22.txt           536870912          65536     80.9403431     93.4165111         13.36%
file23.txt          1073741824         524288    161.8108051    187.1303582         13.53%
file24.txt          2147483648          65536    323.4812710    374.3899529         13.60%

The Chunked Time is execution time when the file is broken into chuck and hased incrementally; the Complete Time is execution time when the entire file is hashed in one go. The %diff is the percentage difference between Chunked Time and 'Complete Time'.

Observations:

For smaller file sizes the chunk size is nearly always equal to file size and there appears to be no advantage in adopting either approach.
For larger files (33554432 (2**25) bytes and above), there appears to be considerable performance benefit (lesser time) in using the incremental approach rather than hashing the entire file in one go.
For larger files it the best chunk/buffer size is 65536 (2**16) bytes

Notes: python 2.7.3; Ubuntu 12.06 64 bit; 8 Gigs of RAM The code used for this is available here... http://pastebin.com/VxH7bL2X

152

answered Sep 23 '22 10:09

Verma

Related questions
                            
                                make only left subplots and bottom subplots (in MxN panels) contain x and y axis labels
                            
                                Deleting a Label in Python Tkinter?
                            
                                Parse HTML Table with Python BeautifulSoup
                            
                                Using python ctypes and libc to write void pointer to binary file
                            
                                Does PHP have an equivalent to Python's all()
                            
                                View Commands in Pipeline in Redis-Py?
                            
                                Python extracting body from function at runtime
                            
                                How can I simulate GNU C Library drem / remainder function in Google App Engine Python 2.7 runtime?
                            
                                how to emit signal from a non PyQt class?
                            
                                Importing data with numpy.loadtxt: No delimiters but fixed column width
                            
                                The fastest way to calculate eigenvalues of large matrices
                            
                                python animation without globals
                            
                                gevent and posgres: Asynchronous connection failed
                            
                                Speed up numpy.where for extracting integer segments?
                            
                                Django default tables
                            
                                voronoi and lloyd relaxation using python/scipy
                            
                                python library package or interactive tool for networkX interactive visualization
                            
                                Python multiprocessing map function error
                            
                                How to use ^ and $ to parse simple expression?
                            
                                constant string file in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hashlib: optimal size of chunks to be used in md5.update()

Tags:

python

md5

hashlib

Verma

People also ask

1 Answers

Verma

Recent Activity

Donate For Us