I need to get a base64-encoded MD5 hash of an object, where the object is an image stored as a file, fname.
I've tried this:
def get_md5(fname):
hash = hashlib.md5()
with open(fname) as f:
for chunk in iter(lambda: f.read(4096), ""):
hash.update(chunk)
return hash.hexdigest().encode('base64').strip()
However, I don't think this is right because it returns a string with too many characters. My understanding is that it needs to be 24 characters long. I get
NjJiM2RlOWMzOTYxYmM3MDI5Y2Q1NzdjOTQ5YWRlYTQ=
I've tried a few other similar ways as well, for example, one that does not do the chunk loop thing. They all return the same string.
(My later actions that need the base64-encoded MD5 hash fail, and I'm thinking this could be why.)
Using Python to decode strings: Decoding Base64 string is exactly opposite to that of encoding. First we convert the Base64 strings into unencoded data bytes followed by conversion into bytes-like object into a string. The below example depicts the decoding of the above example encode string output.
An MD5 value is always 22 (useful) characters long in Base64 notation. Many Base64 algorithms will also append 2 characters of padding when encoding an MD5 hash, bringing the total to 24 characters. The padding adds no useful information and can be discarded.
I was able to make it work by using digest() instead of hexdigest(). Then the last line becomes:
return hash.digest().encode('base64').strip()
The result was then 24 characters long, and it was accepted by Google Cloud Storage transfer, which required a base64-encoded MD5 hash.
For Python 3 (from the comment below):
import base64;
return base64.b64encode(h.digest()).decode()
First, base64 encoding makes strings longer. (Example using IPython with Python 3):
In [1]: s = '123456789012345678901234'
In [2]: len(s)
Out[2]: 24
In [3]: import base64
In [4]: e = base64.b64encode(s.encode('utf8'))
In [5]: len(e)
Out[5]: 32
In [6]: e
Out[6]: b'MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0'
With base64 encoding you get 8 bits of output for every 6 bits of input.
In [7]: 32/24
Out[7]: 1.333
In [8]: 8/6
Out[8]: 1.333
The base64 alphabet uses 64 (or 2**6) different symbols.
Generally they include lower- and uppercase letters, the digits 0-9. This leaves two extra required symbols and a pading character.
Often +
and /
are used as symbols, but there are variations. Especially since /
is not allowed in UNIX or MS-Windows filenames.
Second, using a hexadecimal representation doubles the length of a byte string; the hex representation of one byte can vary between 00 and FF. Example (again using IPython and Python 3):
In [1]: import hashlib
In [2]: s = b'this is a simple test'
In [3]: len(hashlib.md5(s).digest())
Out[3]: 16
In [4]: len(hashlib.md5(s).hexdigest())
Out[4]: 32
If you are going to use base64 encoding anyway, it makes no sense to use hexdigest()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With