Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this "fast hash" function dangerous?

Tags:

python

hash

I need a smart copy function for reliable and fast file copying & linking. The files are very large (from some gigabytes to over 200GB) and distributed over a lot of folders with people renaming files and maybe folders during the day, so I want to use hashes to see if I've copied a file already, maybe under a different name, and only create a link in that case.

Im completely new to hashing and I'm using this function here to hash:

import hashlib

def calculate_sha256(cls, file_path, chunk_size=2 ** 10):
    '''
    Calculate the Sha256 for a given file.

    @param file_path: The file_path including the file name.
    @param chunk_size: The chunk size to allow reading of large files.
    @return Sha256 sum for the given file.
    '''
    sha256 = hashlib.sha256()
    with open(file_path, mode="rb") as f:
        for i in xrange(0,16):
            chunk = f.read(chunk_size)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()

This takes one minute for a 3GB file, so in the end, the process might be very slow for a 16TB HD.

Now my idea is to use some additional knowledge about the files' internal structure to speed things up: I know they contain a small header, then a lot of measurement data, and I know they contain real-time timestamps, so I'm quite sure that the chance that, let's say, the first 16MB of two files are identical, is very low (for that to happen, two files would need to be created at exactly the same time under exactly the same environmental conditions). So my conclusion is that it should be enough to hash only the first X MB of each file.

It works on my example data, but as I'm unexperienced I just wanted to ask if there is something I'm not aware of (hidden danger or a better way to do it).

Thank you very much!

like image 525
Blutkoete Avatar asked Oct 20 '22 06:10

Blutkoete


1 Answers

You can get the MD5 hash of large files, by breaking them into small byte chunks.

Also, calculating MD5 hashes is significantly faster than SHA-256 and should be favored for performance reasons for any application that doesn't rely on the hash for security purposes.

like image 197
Alex W Avatar answered Oct 23 '22 07:10

Alex W