Get MD5 hash of big files in Python

People also ask

How does Python calculate MD5 of a file?

# Import hashlib library (md5 method is part of it) import hashlib # File to check file_name = 'filename.exe' # Correct original md5 goes here original_md5 = '5d41402abc4b2a76b9719d911017c592' # Open,close, read file and calculate MD5 on its contents with open(file_name, 'rb') as file_to_check: # read contents of the ...

How do you find the hash of a file in Python?

Source Code to Find HashHash functions are available in the hashlib module. We loop till the end of the file using a while loop. On reaching the end, we get empty bytes object. In each iteration, we only read 1024 bytes (this value can be changed according to our wish) from the file and update the hashing function.

How do I find the MD5 hash of a file?

Open a terminal window. Type the following command: md5sum [type file name with extension here] [path of the file] -- NOTE: You can also drag the file to the terminal window instead of typing the full path. Hit the Enter key. You'll see the MD5 sum of the file.

You need to read the file in chunks of suitable size:

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

NOTE: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.

So to do the whole lot in one method - use something like:

def generate_file_md5(rootdir, filename, blocksize=2**20):
    m = hashlib.md5()
    with open( os.path.join(rootdir, filename) , "rb" ) as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            m.update( buf )
    return m.hexdigest()

The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 windows installation

I cross-checked the results using the 'jacksum' tool.

jacksum -a md5 <filename>

http://www.jonelo.de/java/jacksum/

Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().

This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.

In Python 3.8+ you can do

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Below I've incorporated suggestion from comments. Thank you all!

Python < 3.7

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''): 
            h.update(chunk)
    return h.digest()

Python 3.8 and above

import hashlib

def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
    h = hash_factory()
    with open(filename,'rb') as f: 
        while chunk := f.read(chunk_num_blocks*h.block_size): 
            h.update(chunk)
    return h.digest()

Original post

If you want a more Pythonic (no while True) way of reading the file check this code:

import hashlib

def checksum_md5(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(8192), b''): 
            md5.update(chunk)
    return md5.digest()

Note that the iter() function needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

Here's my version of @Piotr Czapla's method:

def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename, 'rb') as f:
        for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
            md5.update(chunk)
    return md5.hexdigest()

Using multiple comment/answers in this thread, here is my solution :

import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
    '''
    Block size directly depends on the block size of your filesystem
    to avoid performances issues
    Here I have blocks of 4096 octets (Default NTFS)
    '''
    md5 = hashlib.md5()
    with open(path,'rb') as f: 
        for chunk in iter(lambda: f.read(block_size), b''): 
             md5.update(chunk)
    if hr:
        return md5.hexdigest()
    return md5.digest()

This is "pythonic"
This is a function
It avoids implicit values: always prefer explicit ones.
It allows (very important) performances optimizations

And finally,

- This has been built by a community, thanks all for your advices/ideas.

Related questions
                            
                                How to limit the maximum value of a numeric field in a Django model?
                            
                                What is the relationship between virtualenv and pyenv?
                            
                                How to change default Anaconda python environment
                            
                                What's a correct and good way to implement __hash__()?
                            
                                Python if not == vs if !=
                            
                                pip: force install ignoring dependencies
                            
                                multiprocessing vs multithreading vs asyncio in Python 3
                            
                                How to add a new row to an empty numpy array
                            
                                python pandas: apply a function with arguments to a series
                            
                                Why is early return slower than else?
                            
                                How to access the local Django webserver from outside world
                            
                                Setting different color for each series in scatter plot on matplotlib
                            
                                Iterating through directories with Python
                            
                                How can I one hot encode in Python?
                            
                                How to do multiple arguments to map function where one remains the same in python?
                            
                                Pandas dataframe get first row of each group
                            
                                Python - Get path of root project structure
                            
                                Python - Extracting and Saving Video Frames
                            
                                Print list without brackets in a single row
                            
                                How can I filter a date of a DateTimeField in Django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get MD5 hash of big files in Python

Tags:

python

md5

hashlib

People also ask

Python < 3.7

Python 3.8 and above

Original post

Recent Activity

Donate For Us