Found this undocumented _md5
when getting frustrated with the slow stdlib hashlib.md5
implementation.
On a macbook:
>>> timeit hashlib.md5(b"hello world")
597 ns ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"hello world")
224 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/_md5.cpython-37m-darwin.so'>
On a Windows box:
>>> timeit hashlib.md5(b"stonk overflow")
328 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"stonk overflow")
110 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' (built-in)>
On a Linux box:
>>> timeit hashlib.md5(b"https://adventofcode.com/2016/day/5")
259 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"https://adventofcode.com/2016/day/5")
102 ns ± 0.0576 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/lib/python3.8/lib-dynload/_md5.cpython-38-x86_64-linux-gnu.so'>
For hashing short messages, it's way faster. For long messages, similar performance.
Why is it hidden away in an underscore extension module, and why isn't this faster implementation used by default in hashlib? What is the _md5
module and why doesn't it have public API?
Til Python 2.5, the hashes and digests were implemented in their own modules (e.g. [Python 2.Docs]: md5 - MD5 message digest algorithm).
Starting with v2.5, [Python 2.6.Docs]: hashlib - Secure hashes and message digests was added. Its purpose was to:
As a side effect of #2., the Python implementations were hidden from the public API (renamed them: _md5, _sha1, _sha256, _sha512, and the latter ones added: _blake2, _sha3), as redundancy often creates confusions.
But, another side effect was _hashlib.so dependency on OpenSSL's libcrypto*.so (this is Nix (at least Lnx) specific, on Win, a static libeay32.lib was linked in _hashlib.pyd, and also _ssl.pyd (which I consider lame), till v3.7+, where OpenSSL .dlls are part of the Python installation).
Probably on 90+% of the machines things were smooth, as OpenSSL was / is installed by default, but for those where it isn't, many things might get broken because for example hashlib is imported by many modules (one such example is random which itself gets imported by lots of others), so trivial pieces of code that are not related at all to cryptography (at least not at 1st sight) will stop working. That's why the old implementations are kept (but again, they are only fallbacks as OpenSSL versions are / should be better maintained).
[cfati@cfati-ubtu16x64-0:~/Work/Dev/StackOverflow/q059955854]> ~/sopr.sh *** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages *** [064bit-prompt]> python3 -c "import sys, hashlib as hl, _md5, ssl;print(\"{0:}\n{1:}\n{2:}\n{3:}\".format(sys.version, _md5, hl._hashlib, ssl.OPENSSL_VERSION))" 3.5.2 (default, Oct 8 2019, 13:06:37) [GCC 5.4.0 20160609] <module '_md5' (built-in)> <module '_hashlib' from '/usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so'> OpenSSL 1.0.2g 1 Mar 2016 [064bit-prompt]> [064bit-prompt]> ldd /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so linux-vdso.so.1 => (0x00007fffa7d0b000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f50d9e4d000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f50d9a83000) libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f50d963e000) /lib64/ld-linux-x86-64.so.2 (0x00007f50da271000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f50d943a000) [064bit-prompt]> [064bit-prompt]> openssl version -a OpenSSL 1.0.2g 1 Mar 2016 built on: reproducible build, date unspecified platform: debian-amd64 options: bn(64,64) rc4(16x,int) des(idx,cisc,16,int) blowfish(idx) compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM OPENSSLDIR: "/usr/lib/ssl" [064bit-prompt]> [064bit-prompt]> python3 -c "import _md5, hashlib as hl;print(_md5.md5(b\"A\").hexdigest(), hl.md5(b\"A\").hexdigest())" 7fc56270e7a70fa81a5935b72eacbe29 7fc56270e7a70fa81a5935b72eacbe29
According to [Python 3.Docs]: hashlib.algorithms_guaranteed:
A set containing the names of the hash algorithms guaranteed to be supported by this module on all platforms. Note that ‘md5’ is in this list despite some upstream vendors offering an odd “FIPS compliant” Python build that excludes it.
Below it's an example of a custom Python 2.7 installation (that I built quite a while ago, worth mentioning that it dynamically links to OpenSSL .dlls):
e:\Work\Dev\StackOverflow\q059955854>sopr.bat *** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages *** [prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import sys, ssl;print(\"{0:}\n{1:}\".format(sys.version, ssl.OPENSSL_VERSION))" 2.7.10 (default, Mar 8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)] OpenSSL 1.0.2j-fips 26 Sep 2016 [prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import hashlib as hl;print(hl.md5(\"A\").hexdigest())" 7fc56270e7a70fa81a5935b72eacbe29 [prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import ssl;ssl.FIPS_mode_set(True);import hashlib as hl;print(hl.md5(\"A\").hexdigest())" Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: error:060A80A3:digital envelope routines:FIPS_DIGESTINIT:disabled for fips
As for the speed question I can only speculate:
Considering the above, it would make sense that the former is (slightly) faster (at least for small messages, where the overhead (function call and other Python underlying operations) takes a significant percentage of the total time compared to the hashing itself). There are also other factors to be considered (e.g. whether OpenSSL assembler speedups were used).
Below are some benchmarks of my own.
code00.py:
#!/usr/bin/env python
import sys
from hashlib import md5 as md5_openssl
from _md5 import md5 as md5_builtin
import timeit
def main(*argv):
base_text = b"A"
number = 1000000
print("timeit attempts number: {0:d}".format(number))
#x = []
#y = {}
for count in range(0, 16):
factor = 2 ** count
text = base_text * factor
globals_dict = {"text": text}
#x.append(factor)
print("\nUsing a {0:8d} (2 ** {1:2d}) bytes message".format(len(text), count))
for func in [
md5_openssl,
md5_builtin,
]:
globals_dict["md5"] = func
t = timeit.timeit(stmt="md5(text)", globals=globals_dict, number=number)
print(" {0:12s} took: {1:11.6f} seconds".format(func.__name__, t))
#y.setdefault(func.__name__, []).append(t)
#print(x, y)
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
main(*sys.argv[1:])
print("\nDone.")
Output:
Win 10 pc064 (running on a Dell Precision 5510 laptop):
[prompt]> "e:\Work\Dev\VEnvs\py_pc064_03.07.06_test0\Scripts\python.exe" code00.py Python 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] 64bit on win32 timeit attempts number: 1000000 Using a 1 (2 ** 0) bytes message openssl_md5 took: 0.449134 seconds md5 took: 0.120021 seconds Using a 2 (2 ** 1) bytes message openssl_md5 took: 0.460399 seconds md5 took: 0.118555 seconds Using a 4 (2 ** 2) bytes message openssl_md5 took: 0.451850 seconds md5 took: 0.121166 seconds Using a 8 (2 ** 3) bytes message openssl_md5 took: 0.438398 seconds md5 took: 0.118127 seconds Using a 16 (2 ** 4) bytes message openssl_md5 took: 0.454653 seconds md5 took: 0.122818 seconds Using a 32 (2 ** 5) bytes message openssl_md5 took: 0.450776 seconds md5 took: 0.118594 seconds Using a 64 (2 ** 6) bytes message openssl_md5 took: 0.555761 seconds md5 took: 0.278812 seconds Using a 128 (2 ** 7) bytes message openssl_md5 took: 0.681296 seconds md5 took: 0.455921 seconds Using a 256 (2 ** 8) bytes message openssl_md5 took: 0.895952 seconds md5 took: 0.807457 seconds Using a 512 (2 ** 9) bytes message openssl_md5 took: 1.401584 seconds md5 took: 1.499279 seconds Using a 1024 (2 ** 10) bytes message openssl_md5 took: 2.360966 seconds md5 took: 2.878650 seconds Using a 2048 (2 ** 11) bytes message openssl_md5 took: 4.383245 seconds md5 took: 5.655477 seconds Using a 4096 (2 ** 12) bytes message openssl_md5 took: 8.264774 seconds md5 took: 10.920909 seconds Using a 8192 (2 ** 13) bytes message openssl_md5 took: 15.521947 seconds md5 took: 21.895179 seconds Using a 16384 (2 ** 14) bytes message openssl_md5 took: 29.947287 seconds md5 took: 43.198639 seconds Using a 32768 (2 ** 15) bytes message openssl_md5 took: 59.123447 seconds md5 took: 86.453821 seconds Done.
Ubtu 16 pc064 (VM running in VirtualBox on the above machine):
[064bit-prompt]> python3 code00.py Python 3.5.2 (default, Oct 8 2019, 13:06:37) [GCC 5.4.0 20160609] 64bit on linux timeit attempts number: 1000000 Using a 1 (2 ** 0) bytes message openssl_md5 took: 0.246166 seconds md5 took: 0.130589 seconds Using a 2 (2 ** 1) bytes message openssl_md5 took: 0.251019 seconds md5 took: 0.127750 seconds Using a 4 (2 ** 2) bytes message openssl_md5 took: 0.257018 seconds md5 took: 0.123116 seconds Using a 8 (2 ** 3) bytes message openssl_md5 took: 0.245399 seconds md5 took: 0.128267 seconds Using a 16 (2 ** 4) bytes message openssl_md5 took: 0.251832 seconds md5 took: 0.136373 seconds Using a 32 (2 ** 5) bytes message openssl_md5 took: 0.248410 seconds md5 took: 0.140708 seconds Using a 64 (2 ** 6) bytes message openssl_md5 took: 0.361016 seconds md5 took: 0.267021 seconds Using a 128 (2 ** 7) bytes message openssl_md5 took: 0.478735 seconds md5 took: 0.413986 seconds Using a 256 (2 ** 8) bytes message openssl_md5 took: 0.707602 seconds md5 took: 0.695042 seconds Using a 512 (2 ** 9) bytes message openssl_md5 took: 1.216832 seconds md5 took: 1.268570 seconds Using a 1024 (2 ** 10) bytes message openssl_md5 took: 2.122014 seconds md5 took: 2.429623 seconds Using a 2048 (2 ** 11) bytes message openssl_md5 took: 4.158188 seconds md5 took: 4.847686 seconds Using a 4096 (2 ** 12) bytes message openssl_md5 took: 7.839173 seconds md5 took: 9.242224 seconds Using a 8192 (2 ** 13) bytes message openssl_md5 took: 15.282232 seconds md5 took: 18.368874 seconds Using a 16384 (2 ** 14) bytes message openssl_md5 took: 30.681912 seconds md5 took: 36.755073 seconds Using a 32768 (2 ** 15) bytes message openssl_md5 took: 60.230543 seconds md5 took: 73.237356 seconds Done.
The result seem to be quite different than yours. In my case:
As a conclusion, if all your messages are small, and the builtin implementation works best for you, then use it.
Graphical representation (I had to reduce the timeit iterations number by an order of magnitude, as it would take much too long for large messages):
and zooming on the area where the 2 graphs intersect:
It is common for Python public modules to delegate methods to a hidden module.
For example, the full code of the collections.abc
module is:
from _collections_abc import *
from _collections_abc import __all__
The functions of hashlib
are dynamically created:
for __func_name in __always_supported:
# try them all, some may not work due to the OpenSSL
# version not supporting that algorithm.
try:
globals()[__func_name] = __get_hash(__func_name)
The definition of always_supported
is:
__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512',
'blake2b', 'blake2s',
'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512',
'shake_128', 'shake_256')
And get_hash
either __get_openssl_constructor
or __get_builtin_constructor
:
try:
import _hashlib
new = __hash_new
__get_hash = __get_openssl_constructor
algorithms_available = algorithms_available.union(
_hashlib.openssl_md_meth_names)
except ImportError:
new = __py_new
__get_hash = __get_builtin_constructor
__get_builtin_constructor
is a fallback for the (again) hidden _hashlib
module:
def __get_openssl_constructor(name):
if name in __block_openssl_constructor:
# Prefer our blake2 and sha3 implementation.
return __get_builtin_constructor(name)
try:
f = getattr(_hashlib, 'openssl_' + name)
# Allow the C module to raise ValueError. The function will be
# defined but the hash not actually available thanks to OpenSSL.
f()
# Use the C function directly (very fast)
return f
except (AttributeError, ValueError):
return __get_builtin_constructor(name)
Above in the hashlib
code, you have this:
def __get_builtin_constructor(name):
cache = __builtin_constructor_cache
...
elif name in {'MD5', 'md5'}:
import _md5
cache['MD5'] = cache['md5'] = _md5.md5
But md5
is not in __block_openssl_constructor
, hence the _hashlib/openssl
version is preferred to the _md5/builtin
version:
Confirmation in the REPL:
>>> hashlib.md5
<built-in function openssl_md5>
>>> _md5.md5
<built-in function md5>
Those functions are different implementations of the MD5 algorithm and the openssl_md5
makes a call to a dynamic system library. That's why you have some performance changes. The first version is defined in https://github.com/python/cpython/blob/master/Modules/_hashopenssl.c and the other in https://github.com/python/cpython/blob/master/Modules/md5module.c, if you want to check the differences.
Then why is the _md5.md5
function defined but never used? I guess the idea is to ensure that some algorithms are always available, even if openssl
is absent:
Constructors for hash algorithms that are always present in this module are sha1(), sha224(), sha256(), sha384(), sha512(), blake2b(), and blake2s(). (https://docs.python.org/3/library/hashlib.html)
My theory from looking around on bugs.python.org and reading the cpython git commit history:
cpython switched to openssl md5 in 2005 because it was faster than the built-in implementation. They added a new built-in implementation in 2007 that is faster than openssl but never switched back. Both of these changes were done by Gregory P. Smith.
Here's my evidence.
_md5
implementation appears to be essentially the same in Python 3.8 (I'm looking at commit ea316fd21527)I think the cpython maintainers would probably be open to switching back to _md5
when it is available, since it's no longer true that the openssl implementation is faster (and may have been untrue for the last 13 years).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With