Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way (than this) to calculate the hash of a file (using hashlib) in Python?

Tags:

python

hashlib

My current approach is this:

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    with open(path, 'rb') as f:
         for block in iter(lambda: f.read(1024*func.block_size, b''):
             func.update(block)
    return func.hexdigest()

It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 @ 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?

EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).

EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(

EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.

Another solution (~-0.5 sec, slightly faster) is to use os.open():

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    f = os.open(path, (os.O_RDWR | os.O_BINARY))
    for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
        func.update(block)
    os.close(f)
    return func.hexdigest()

Note that these results are not final.

like image 398
Deneb Avatar asked Oct 20 '22 11:10

Deneb


1 Answers

Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.

  • Using your first method required 21 seconds.
  • Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
  • Using the following function with a buffer size of 8096 required 17 seconds.
  • Using the following function with a buffer size of 32767 required 11 seconds.
  • Using the following function with a buffer size of 65536 required 8 seconds.
  • Using the following function with a buffer size of 131072 required 8 seconds.
  • Using the following function with a buffer size of 1048576 required 12 seconds.
def md5_speedcheck(path, size):
    pts = time.process_time()
    ats = time.time()
    m = hashlib.md5()
    with open(path, 'rb') as f:
        b = f.read(size)
        while len(b) > 0:
            m.update(b)
            b = f.read(size)
    print("{0:.3f} s".format(time.process_time() - pts))
    print("{0:.3f} s".format(time.time() - ats))

Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.

The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.

like image 178
Lance Helsten Avatar answered Oct 23 '22 03:10

Lance Helsten