I am trying to generate a hash for a given file, in this case the hash function got to a binary file (.tgz file) and then generated an error. Is there a way I can read a binary file and generate a md5 hash of it?
The Error I am receiving is:
buffer = buffer.decode('UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 10: invalid start byte
The source code is:
import hashlib
def HashFile(filename, readBlockSize = 4096):
    hash = hashlib.md5()
    with open(filename, 'rb') as fileHandle:
        while True:
            buffer = fileHandle.read(readBlockSize)
            if not buffer:
                break
            buffer = buffer.decode('UTF-8')                
            hash.update(hashlib.md5(buffer).hexdigest())
    return
I am using Python 3.7 on Linux.
There are a couple of things you can tweak here.
You don't need to decode the bytes returned by .read(), because md5() is expecting bytes in the first place, not str:
>>> import hashlib
>>> h = hashlib.md5(open('dump.rdb', 'rb').read()).hexdigest()
>>> h
'9a7bf9d3fd725e8b26eee3c31025b18e'
This means you can remove the line buffer = buffer.decode('UTF-8') from your function.
You'll also need to return hash if you want to use the results of the function.
Lastly, you need to pass the raw block of bytes to .update(), not its hex digest (which is a str); see the docs' example.
Putting it all together:
def hash_file(filename: str, blocksize: int = 4096) -> str:
    hsh = hashlib.md5()
    with open(filename, "rb") as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            hsh.update(buf)
    return hsh.hexdigest()
(The above is an example using a Redis .rdb dump binary file.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With