I was working on a simple class that extends dict
, and I realized that key lookup and use of pickle
are very slow.
I thought it was a problem with my class, so I did some trivial benchmarks:
(venv) marco@buzz:~/sources/python-frozendict/test$ python --version
Python 3.9.0a0
(venv) marco@buzz:~/sources/python-frozendict/test$ sudo pyperf system tune --affinity 3
[sudo] password for marco:
Tune the system configuration to run benchmarks
Actions
=======
CPU Frequency: Minimum frequency of CPU 3 set to the maximum frequency
System state
============
CPU: use 1 logical CPUs: 3
Perf event: Maximum sample rate: 1 per second
ASLR: Full randomization
Linux scheduler: No CPU is isolated
CPU Frequency: 0-3=min=max=2600 MHz
CPU scaling governor (intel_pstate): performance
Turbo Boost (intel_pstate): Turbo Boost disabled
IRQ affinity: irqbalance service: inactive
IRQ affinity: Default IRQ affinity: CPU 0-2
IRQ affinity: IRQ affinity: IRQ 0,2=CPU 0-3; IRQ 1,3-17,51,67,120-131=CPU 0-2
Power supply: the power cable is plugged
Advices
=======
Linux scheduler: Use isolcpus=<cpu list> kernel parameter to isolate CPUs
Linux scheduler: Use rcu_nocbs=<cpu list> kernel parameter (with isolcpus) to not schedule RCU on isolated CPUs
(venv) marco@buzz:~/sources/python-frozendict/test$ python -m pyperf timeit --rigorous --affinity 3 -s '
x = {0:0, 1:1, 2:2, 3:3, 4:4}
' 'x[4]'
.........................................
Mean +- std dev: 35.2 ns +- 1.8 ns
(venv) marco@buzz:~/sources/python-frozendict/test$ python -m pyperf timeit --rigorous --affinity 3 -s '
class A(dict):
pass
x = A({0:0, 1:1, 2:2, 3:3, 4:4})
' 'x[4]'
.........................................
Mean +- std dev: 60.1 ns +- 2.5 ns
(venv) marco@buzz:~/sources/python-frozendict/test$ python -m pyperf timeit --rigorous --affinity 3 -s '
x = {0:0, 1:1, 2:2, 3:3, 4:4}
' '5 in x'
.........................................
Mean +- std dev: 31.9 ns +- 1.4 ns
(venv) marco@buzz:~/sources/python-frozendict/test$ python -m pyperf timeit --rigorous --affinity 3 -s '
class A(dict):
pass
x = A({0:0, 1:1, 2:2, 3:3, 4:4})
' '5 in x'
.........................................
Mean +- std dev: 64.7 ns +- 5.4 ns
(venv) marco@buzz:~/sources/python-frozendict/test$ python
Python 3.9.0a0 (heads/master-dirty:d8ca2354ed, Oct 30 2019, 20:25:01)
[GCC 9.2.1 20190909] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from timeit import timeit
>>> class A(dict):
... def __reduce__(self):
... return (A, (dict(self), ))
...
>>> timeit("dumps(x)", """
... from pickle import dumps
... x = {0:0, 1:1, 2:2, 3:3, 4:4}
... """, number=10000000)
6.70694484282285
>>> timeit("dumps(x)", """
... from pickle import dumps
... x = A({0:0, 1:1, 2:2, 3:3, 4:4})
... """, number=10000000, globals={"A": A})
31.277778962627053
>>> timeit("loads(x)", """
... from pickle import dumps, loads
... x = dumps({0:0, 1:1, 2:2, 3:3, 4:4})
... """, number=10000000)
5.767975459806621
>>> timeit("loads(x)", """
... from pickle import dumps, loads
... x = dumps(A({0:0, 1:1, 2:2, 3:3, 4:4}))
... """, number=10000000, globals={"A": A})
22.611666693352163
The results are really a surprise. While key lookup is 2x slower, pickle
is 5x slower.
How can this be? Other methods, like get()
,__eq__()
and __init__()
, and iteration over keys()
, values()
and items()
are as fast as dict
.
EDIT: I took a look at the source code of Python 3.9, and in Objects/dictobject.c
it seems that the __getitem__()
method is implemented by dict_subscript()
. And dict_subscript()
slows down subclasses only if the key is missing, since the subclass can implement __missing__()
and it tries to see if it exists. But the benchmark was with an existing key.
But I noticed something: __getitem__()
is defined with the flag METH_COEXIST
. And also __contains__()
, the other method that is 2x slower, has the same flag. From the official documentation:
The method will be loaded in place of existing definitions. Without METH_COEXIST, the default is to skip repeated definitions. Since slot wrappers are loaded before the method table, the existence of a sq_contains slot, for example, would generate a wrapped method named contains() and preclude the loading of a corresponding PyCFunction with the same name. With the flag defined, the PyCFunction will be loaded in place of the wrapper object and will co-exist with the slot. This is helpful because calls to PyCFunctions are optimized more than wrapper object calls.
So if I understood correctly, in theory METH_COEXIST
should speed things up, but it seems to have the opposite effect. Why?
EDIT 2: I discovered something more.
__getitem__()
and __contains()__
are flagged as METH_COEXIST
, because they are declared in PyDict_Type two times.
They are both present, one time, in the slot tp_methods
, where they are explicitly declared as __getitem__()
and __contains()__
. But the official documentation says that tp_methods
are not inherited by subclasses.
So a subclass of dict
does not call __getitem__()
, but calls the subslot mp_subscript
. Indeed, mp_subscript
is contained in the slot tp_as_mapping
, that permits a subclass to inherit its subslots.
The problem is that both __getitem__()
and mp_subscript
use the same function, dict_subscript
. Is it possible that it's only the way it was inherited that slows it down?
Indexing and in
are slower in dict
subclasses because of a bad interaction between a dict
optimization and the logic subclasses use to inherit C slots. This should be fixable, though not from your end.
The CPython implementation has two sets of hooks for operator overloads. There are Python-level methods like __contains__
and __getitem__
, but there's also a separate set of slots for C function pointers in the memory layout of a type object. Usually, either the Python method will be a wrapper around the C implementation, or the C slot will contain a function that searches for and calls the Python method. It's more efficient for the C slot to implement the operation directly, as the C slot is what Python actually accesses.
Mappings written in C implement the C slots sq_contains
and mp_subscript
to provide in
and indexing. Ordinarily, the Python-level __contains__
and __getitem__
methods would be automatically generated as wrappers around the C functions, but the dict
class has explicit implementations of __contains__
and __getitem__
, because the explicit implementations are a bit faster than the generated wrappers:
static PyMethodDef mapp_methods[] = {
DICT___CONTAINS___METHODDEF
{"__getitem__", (PyCFunction)(void(*)(void))dict_subscript, METH_O | METH_COEXIST,
getitem__doc__},
...
(Actually, the explicit __getitem__
implementation is the same function as the mp_subscript
implementation, just with a different kind of wrapper.)
Ordinarily, a subclass would inherit its parent's implementations of C-level hooks like sq_contains
and mp_subscript
, and the subclass would be just as fast as the superclass. However, the logic in update_one_slot
looks for the parent implementation by trying to find the generated wrapper methods through an MRO search.
dict
doesn't have generated wrappers for sq_contains
and mp_subscript
, because it provides explicit __contains__
and __getitem__
implementations.
Instead of inheriting sq_contains
and mp_subscript
, update_one_slot
ends up giving the subclass sq_contains
and mp_subscript
implementations that perform an MRO search for __contains__
and __getitem__
and call those. This is much less efficient than inheriting the C slots directly.
Fixing this will require changes to the update_one_slot
implementation.
Aside from what I described above, dict_subscript
also looks up __missing__
for dict subclasses, so fixing the slot inheritance issue won't make subclasses completely on par with dict
itself for lookup speed, but it should get them a lot closer.
As for pickling, on the dumps
side, the pickle implementation has a dedicated fast path for dicts, while the dict subclass takes a more roundabout path through object.__reduce_ex__
and save_reduce
.
On the loads
side, the time difference is mostly just from the extra opcodes and lookups to retrieve and instantiate the __main__.A
class, while dicts have a dedicated pickle opcode for making a new dict. If we compare the disassembly for the pickles:
In [26]: pickletools.dis(pickle.dumps({0: 0, 1: 1, 2: 2, 3: 3, 4: 4}))
0: \x80 PROTO 4
2: \x95 FRAME 25
11: } EMPTY_DICT
12: \x94 MEMOIZE (as 0)
13: ( MARK
14: K BININT1 0
16: K BININT1 0
18: K BININT1 1
20: K BININT1 1
22: K BININT1 2
24: K BININT1 2
26: K BININT1 3
28: K BININT1 3
30: K BININT1 4
32: K BININT1 4
34: u SETITEMS (MARK at 13)
35: . STOP
highest protocol among opcodes = 4
In [27]: pickletools.dis(pickle.dumps(A({0: 0, 1: 1, 2: 2, 3: 3, 4: 4})))
0: \x80 PROTO 4
2: \x95 FRAME 43
11: \x8c SHORT_BINUNICODE '__main__'
21: \x94 MEMOIZE (as 0)
22: \x8c SHORT_BINUNICODE 'A'
25: \x94 MEMOIZE (as 1)
26: \x93 STACK_GLOBAL
27: \x94 MEMOIZE (as 2)
28: ) EMPTY_TUPLE
29: \x81 NEWOBJ
30: \x94 MEMOIZE (as 3)
31: ( MARK
32: K BININT1 0
34: K BININT1 0
36: K BININT1 1
38: K BININT1 1
40: K BININT1 2
42: K BININT1 2
44: K BININT1 3
46: K BININT1 3
48: K BININT1 4
50: K BININT1 4
52: u SETITEMS (MARK at 31)
53: . STOP
highest protocol among opcodes = 4
we see that the difference between the two is that the second pickle needs a whole bunch of opcodes to look up __main__.A
and instantiate it, while the first pickle just does EMPTY_DICT
to get an empty dict. After that, both pickles push the same keys and values onto the pickle operand stack and run SETITEMS
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With