Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is _md5.md5 and why is hashlib.md5 so much slower?

Found this undocumented _md5 when getting frustrated with the slow stdlib hashlib.md5 implementation.

On a macbook:

>>> timeit hashlib.md5(b"hello world")
597 ns ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"hello world")
224 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/_md5.cpython-37m-darwin.so'>

On a Windows box:

>>> timeit hashlib.md5(b"stonk overflow")
328 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"stonk overflow")
110 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' (built-in)>

On a Linux box:

>>> timeit hashlib.md5(b"https://adventofcode.com/2016/day/5")
259 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"https://adventofcode.com/2016/day/5")
102 ns ± 0.0576 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/lib/python3.8/lib-dynload/_md5.cpython-38-x86_64-linux-gnu.so'>

For hashing short messages, it's way faster. For long messages, similar performance.

Why is it hidden away in an underscore extension module, and why isn't this faster implementation used by default in hashlib? What is the _md5 module and why doesn't it have public API?

like image 496
wim Avatar asked Jan 28 '20 19:01

wim


3 Answers

Til Python 2.5, the hashes and digests were implemented in their own modules (e.g. [Python 2.Docs]: md5 - MD5 message digest algorithm).
Starting with v2.5, [Python 2.6.Docs]: hashlib - Secure hashes and message digests was added. Its purpose was to:

  1. Offer an unified access method to the hashes / digests (via their name)
  2. Switch (by default) to an external cryptography provider (it seems the logical step to delegate to some entity specialized in that field, as maintaining all those algorithms could be an overkill). At that time OpenSSL was the best choice: mature enough, known and compatible (there were a bunch of similar Java providers, but those were pretty useless)

As a side effect of #2., the Python implementations were hidden from the public API (renamed them: _md5, _sha1, _sha256, _sha512, and the latter ones added: _blake2, _sha3), as redundancy often creates confusions.
But, another side effect was _hashlib.so dependency on OpenSSL's libcrypto*.so (this is Nix (at least Lnx) specific, on Win, a static libeay32.lib was linked in _hashlib.pyd, and also _ssl.pyd (which I consider lame), till v3.7+, where OpenSSL .dlls are part of the Python installation).
Probably on 90+% of the machines things were smooth, as OpenSSL was / is installed by default, but for those where it isn't, many things might get broken because for example hashlib is imported by many modules (one such example is random which itself gets imported by lots of others), so trivial pieces of code that are not related at all to cryptography (at least not at 1st sight) will stop working. That's why the old implementations are kept (but again, they are only fallbacks as OpenSSL versions are / should be better maintained).

[cfati@cfati-ubtu16x64-0:~/Work/Dev/StackOverflow/q059955854]> ~/sopr.sh
*** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ***

[064bit-prompt]> python3 -c "import sys, hashlib as hl, _md5, ssl;print(\"{0:}\n{1:}\n{2:}\n{3:}\".format(sys.version, _md5, hl._hashlib, ssl.OPENSSL_VERSION))"
3.5.2 (default, Oct  8 2019, 13:06:37)
[GCC 5.4.0 20160609]
<module '_md5' (built-in)>
<module '_hashlib' from '/usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so'>
OpenSSL 1.0.2g  1 Mar 2016
[064bit-prompt]>
[064bit-prompt]> ldd /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so
        linux-vdso.so.1 =>  (0x00007fffa7d0b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f50d9e4d000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f50d9a83000)
        libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f50d963e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f50da271000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f50d943a000)
[064bit-prompt]>
[064bit-prompt]> openssl version -a
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
platform: debian-amd64
options:  bn(64,64) rc4(16x,int) des(idx,cisc,16,int) blowfish(idx)
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
OPENSSLDIR: "/usr/lib/ssl"
[064bit-prompt]>
[064bit-prompt]> python3 -c "import _md5, hashlib as hl;print(_md5.md5(b\"A\").hexdigest(), hl.md5(b\"A\").hexdigest())"
7fc56270e7a70fa81a5935b72eacbe29 7fc56270e7a70fa81a5935b72eacbe29

According to [Python 3.Docs]: hashlib.algorithms_guaranteed:

A set containing the names of the hash algorithms guaranteed to be supported by this module on all platforms. Note that ‘md5’ is in this list despite some upstream vendors offering an odd “FIPS compliant” Python build that excludes it.

Below it's an example of a custom Python 2.7 installation (that I built quite a while ago, worth mentioning that it dynamically links to OpenSSL .dlls):

e:\Work\Dev\StackOverflow\q059955854>sopr.bat
*** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ***

[prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import sys, ssl;print(\"{0:}\n{1:}\".format(sys.version, ssl.OPENSSL_VERSION))"
2.7.10 (default, Mar  8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)]
OpenSSL 1.0.2j-fips  26 Sep 2016

[prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import hashlib as hl;print(hl.md5(\"A\").hexdigest())"
7fc56270e7a70fa81a5935b72eacbe29

[prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import ssl;ssl.FIPS_mode_set(True);import hashlib as hl;print(hl.md5(\"A\").hexdigest())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: error:060A80A3:digital envelope routines:FIPS_DIGESTINIT:disabled for fips

As for the speed question I can only speculate:

  • Python implementation was (obviously) written specifically for Python, meaning it is "more optimized" (yes, this is grammatically incorrect) for Python than a generic version, and also resides in python*.so (or the python executable itself)
  • OpenSSL implementation resides in libcrypto*.so, and it's being accessed by the wrapper _hashlib.so, which does the back and forth conversions between Python types (PyObject*) and the OpenSSL ones (EVP_MD_CTX*)

Considering the above, it would make sense that the former is (slightly) faster (at least for small messages, where the overhead (function call and other Python underlying operations) takes a significant percentage of the total time compared to the hashing itself). There are also other factors to be considered (e.g. whether OpenSSL assembler speedups were used).



Update #0

Below are some benchmarks of my own.

code00.py:

#!/usr/bin/env python

import sys
from hashlib import md5 as md5_openssl
from _md5 import md5 as md5_builtin
import timeit


def main(*argv):
    base_text = b"A"
    number = 1000000
    print("timeit attempts number: {0:d}".format(number))
    #x = []
    #y = {}
    for count in range(0, 16):
        factor = 2 ** count
        text = base_text * factor
        globals_dict = {"text": text}
        #x.append(factor)
        print("\nUsing a {0:8d} (2 ** {1:2d}) bytes message".format(len(text), count))
        for func in [
            md5_openssl,
            md5_builtin,
        ]:
            globals_dict["md5"] = func

            t = timeit.timeit(stmt="md5(text)", globals=globals_dict, number=number)
            print("    {0:12s} took: {1:11.6f} seconds".format(func.__name__, t))
            #y.setdefault(func.__name__, []).append(t)
    #print(x, y)


if __name__ == "__main__":
    print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    main(*sys.argv[1:])
    print("\nDone.")

Output:

  • Win 10 pc064 (running on a Dell Precision 5510 laptop):

    [prompt]> "e:\Work\Dev\VEnvs\py_pc064_03.07.06_test0\Scripts\python.exe" code00.py
    Python 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] 64bit on win32
    
    timeit attempts number: 1000000
    
    Using a        1 (2 **  0) bytes message
        openssl_md5  took:    0.449134 seconds
        md5          took:    0.120021 seconds
    
    Using a        2 (2 **  1) bytes message
        openssl_md5  took:    0.460399 seconds
        md5          took:    0.118555 seconds
    
    Using a        4 (2 **  2) bytes message
        openssl_md5  took:    0.451850 seconds
        md5          took:    0.121166 seconds
    
    Using a        8 (2 **  3) bytes message
        openssl_md5  took:    0.438398 seconds
        md5          took:    0.118127 seconds
    
    Using a       16 (2 **  4) bytes message
        openssl_md5  took:    0.454653 seconds
        md5          took:    0.122818 seconds
    
    Using a       32 (2 **  5) bytes message
        openssl_md5  took:    0.450776 seconds
        md5          took:    0.118594 seconds
    
    Using a       64 (2 **  6) bytes message
        openssl_md5  took:    0.555761 seconds
        md5          took:    0.278812 seconds
    
    Using a      128 (2 **  7) bytes message
        openssl_md5  took:    0.681296 seconds
        md5          took:    0.455921 seconds
    
    Using a      256 (2 **  8) bytes message
        openssl_md5  took:    0.895952 seconds
        md5          took:    0.807457 seconds
    
    Using a      512 (2 **  9) bytes message
        openssl_md5  took:    1.401584 seconds
        md5          took:    1.499279 seconds
    
    Using a     1024 (2 ** 10) bytes message
        openssl_md5  took:    2.360966 seconds
        md5          took:    2.878650 seconds
    
    Using a     2048 (2 ** 11) bytes message
        openssl_md5  took:    4.383245 seconds
        md5          took:    5.655477 seconds
    
    Using a     4096 (2 ** 12) bytes message
        openssl_md5  took:    8.264774 seconds
        md5          took:   10.920909 seconds
    
    Using a     8192 (2 ** 13) bytes message
        openssl_md5  took:   15.521947 seconds
        md5          took:   21.895179 seconds
    
    Using a    16384 (2 ** 14) bytes message
        openssl_md5  took:   29.947287 seconds
        md5          took:   43.198639 seconds
    
    Using a    32768 (2 ** 15) bytes message
        openssl_md5  took:   59.123447 seconds
        md5          took:   86.453821 seconds
    
    Done.
    
  • Ubtu 16 pc064 (VM running in VirtualBox on the above machine):

    [064bit-prompt]> python3 code00.py
    Python 3.5.2 (default, Oct  8 2019, 13:06:37) [GCC 5.4.0 20160609] 64bit on linux
    
    timeit attempts number: 1000000
    
    Using a        1 (2 **  0) bytes message
        openssl_md5  took:    0.246166 seconds
        md5          took:    0.130589 seconds
    
    Using a        2 (2 **  1) bytes message
        openssl_md5  took:    0.251019 seconds
        md5          took:    0.127750 seconds
    
    Using a        4 (2 **  2) bytes message
        openssl_md5  took:    0.257018 seconds
        md5          took:    0.123116 seconds
    
    Using a        8 (2 **  3) bytes message
        openssl_md5  took:    0.245399 seconds
        md5          took:    0.128267 seconds
    
    Using a       16 (2 **  4) bytes message
        openssl_md5  took:    0.251832 seconds
        md5          took:    0.136373 seconds
    
    Using a       32 (2 **  5) bytes message
        openssl_md5  took:    0.248410 seconds
        md5          took:    0.140708 seconds
    
    Using a       64 (2 **  6) bytes message
        openssl_md5  took:    0.361016 seconds
        md5          took:    0.267021 seconds
    
    Using a      128 (2 **  7) bytes message
        openssl_md5  took:    0.478735 seconds
        md5          took:    0.413986 seconds
    
    Using a      256 (2 **  8) bytes message
        openssl_md5  took:    0.707602 seconds
        md5          took:    0.695042 seconds
    
    Using a      512 (2 **  9) bytes message
        openssl_md5  took:    1.216832 seconds
        md5          took:    1.268570 seconds
    
    Using a     1024 (2 ** 10) bytes message
        openssl_md5  took:    2.122014 seconds
        md5          took:    2.429623 seconds
    
    Using a     2048 (2 ** 11) bytes message
        openssl_md5  took:    4.158188 seconds
        md5          took:    4.847686 seconds
    
    Using a     4096 (2 ** 12) bytes message
        openssl_md5  took:    7.839173 seconds
        md5          took:    9.242224 seconds
    
    Using a     8192 (2 ** 13) bytes message
        openssl_md5  took:   15.282232 seconds
        md5          took:   18.368874 seconds
    
    Using a    16384 (2 ** 14) bytes message
        openssl_md5  took:   30.681912 seconds
        md5          took:   36.755073 seconds
    
    Using a    32768 (2 ** 15) bytes message
        openssl_md5  took:   60.230543 seconds
        md5          took:   73.237356 seconds
    
    Done.
    

The result seem to be quite different than yours. In my case:

  • Starting somewhere in [~512B .. ~1KiB] sized messages, OpenSSL implementation seems to perform better than builtin one
  • I know that there are too few results to claim a pattern, but it seems that both implementations seem to be linearly proportional (in terms of time) with message size (but the builtin slope seems to be a bit steeper - meaning it will perform worse on the long run)

As a conclusion, if all your messages are small, and the builtin implementation works best for you, then use it.



Update #1

Graphical representation (I had to reduce the timeit iterations number by an order of magnitude, as it would take much too long for large messages):

Img0

and zooming on the area where the 2 graphs intersect:

Img1

like image 177
CristiFati Avatar answered Oct 13 '22 02:10

CristiFati


It is common for Python public modules to delegate methods to a hidden module.

For example, the full code of the collections.abc module is:

from _collections_abc import *
from _collections_abc import __all__

The functions of hashlib are dynamically created:

for __func_name in __always_supported:
    # try them all, some may not work due to the OpenSSL
    # version not supporting that algorithm.
    try:
        globals()[__func_name] = __get_hash(__func_name)

The definition of always_supported is:

__always_supported = ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512',
                      'blake2b', 'blake2s',
                      'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512',
                      'shake_128', 'shake_256')

And get_hash either __get_openssl_constructor or __get_builtin_constructor:

try:
    import _hashlib
    new = __hash_new
    __get_hash = __get_openssl_constructor
    algorithms_available = algorithms_available.union(
            _hashlib.openssl_md_meth_names)
except ImportError:
    new = __py_new
    __get_hash = __get_builtin_constructor

__get_builtin_constructor is a fallback for the (again) hidden _hashlib module:

def __get_openssl_constructor(name):
    if name in __block_openssl_constructor:
        # Prefer our blake2 and sha3 implementation.
        return __get_builtin_constructor(name)
    try:
        f = getattr(_hashlib, 'openssl_' + name)
        # Allow the C module to raise ValueError.  The function will be
        # defined but the hash not actually available thanks to OpenSSL.
        f()
        # Use the C function directly (very fast)
        return f
    except (AttributeError, ValueError):
        return __get_builtin_constructor(name)

Above in the hashlib code, you have this:

def __get_builtin_constructor(name):
    cache = __builtin_constructor_cache
    ...
    elif name in {'MD5', 'md5'}:
        import _md5
        cache['MD5'] = cache['md5'] = _md5.md5

But md5 is not in __block_openssl_constructor, hence the _hashlib/openssl version is preferred to the _md5/builtin version:

Confirmation in the REPL:

>>> hashlib.md5
<built-in function openssl_md5>
>>> _md5.md5
<built-in function md5>

Those functions are different implementations of the MD5 algorithm and the openssl_md5 makes a call to a dynamic system library. That's why you have some performance changes. The first version is defined in https://github.com/python/cpython/blob/master/Modules/_hashopenssl.c and the other in https://github.com/python/cpython/blob/master/Modules/md5module.c, if you want to check the differences.

Then why is the _md5.md5 function defined but never used? I guess the idea is to ensure that some algorithms are always available, even if openssl is absent:

Constructors for hash algorithms that are always present in this module are sha1(), sha224(), sha256(), sha384(), sha512(), blake2b(), and blake2s(). (https://docs.python.org/3/library/hashlib.html)

like image 23
jferard Avatar answered Oct 13 '22 02:10

jferard


My theory from looking around on bugs.python.org and reading the cpython git commit history:

cpython switched to openssl md5 in 2005 because it was faster than the built-in implementation. They added a new built-in implementation in 2007 that is faster than openssl but never switched back. Both of these changes were done by Gregory P. Smith.

Here's my evidence.

  • In 2005, Greg created the bpo issue "sha and md5 modules should use OpenSSL when possible". This change made in this commit.
  • In 2007, Greg added the new, fast md5 module in this commit.
  • The _md5 implementation appears to be essentially the same in Python 3.8 (I'm looking at commit ea316fd21527)

I think the cpython maintainers would probably be open to switching back to _md5 when it is available, since it's no longer true that the openssl implementation is faster (and may have been untrue for the last 13 years).

like image 2
Michael Graczyk Avatar answered Oct 13 '22 03:10

Michael Graczyk