Python 3.7 introduced dataclasses to store data. I'm considering to move to this new approach which is more organized and well structured than a dict.
But I have a doubt. Python transforms keys into hashes on dicts and that makes looking for keys and values much faster. Dataclasses implement something like it?
Which one is faster and why?
Lookups are faster in dictionaries because Python implements them using hash tables. If we explain the difference by Big O concepts, dictionaries have constant time complexity, O(1) while lists have linear time complexity, O(n).
DataClass is slower than others while creating data objects (2.94 µs). NamedTuple is the faster one while creating data objects (2.01 µs). An object is slower than DataClass but faster than NamedTuple while creating data objects (2.34 µs).
The reason is because a dictionary is a lookup, while a list is an iteration. Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.
Python is slow. I bet you might encounter this counterargument many times about using Python, especially from people who come from C or C++ or Java world. This is true in many cases, for instance, looping over or sorting Python arrays, lists, or dictionaries can be sometimes slow.
The reason is dictionaries are very fast, implemented using a technique called hashing, which allows us to access a value very quickly. By contrast, the list of tuples implementation is slow. If we wanted to find a value associated with a key, we would have to iterate over every tuple, checking the 0th element.
All classes in python actually use a dictionary under the hood to store their attributes, as you can read here in the documentation. For a more in-depth reference on how python classes (and many more things) work, you can also check out the article on python's datamodel, in particular the section on custom classes.
So in general, there shouldn't be a loss in performance by moving from dictionaries to dataclasses. But it's better to make sure with the timeit module:
Baseline
# dictionary creation
$ python -m timeit "{'var': 1}"
5000000 loops, best of 5: 52.9 nsec per loop
# dictionary key access
$ python -m timeit -s "d = {'var': 1}" "d['var']"
10000000 loops, best of 5: 20.3 nsec per loop
Basic dataclass
# dataclass creation
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: var: int" "A(1)"
1000000 loops, best of 5: 288 nsec per loop
# dataclass attribute access
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: var: int" -s "a = A(1)" "a.var"
10000000 loops, best of 5: 25.3 nsec per loop
Here we can see that using classes does have some overhead. For class creation it's quite a bit (~5 times slower), but you don't necessarily need to care that much about it as long as you don't plan to create and toss your dataclasses multiple times per second.
The attribute access is probably the more important metric, and while dataclasses are again slower (~1.25 times), this time it's not by that much.
If you think that's still a tad too slow, you can tune your dataclass (or any classes, really) by using slots instead of a dictionary to store their attributes:
Slotted dataclass
# dataclass creation
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: __slots__ = ('var',); var: int" "A(1)"
1000000 loops, best of 5: 242 nsec per loop
# dataclass attribute access
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: __slots__ = ('var',); var: int" -s "a = A(1)" "a.var"
10000000 loops, best of 5: 21.7 nsec per loop
By using this pattern we could shave off a few more more nanoseconds. At this point, at least regarding attribute access, there shouldn't be a noticeable difference to dictionaries any more, and you can use the upsides of dataclasses without compromising speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With