Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is dictionary lookup in Python always slower when using user defined objects as keys?

I noticed that when I use user-defined objects (that override the __hash__ method) as keys of my dicts in Python, lookup time increases by at least a factor 5.

This behaviour is observed even when I use very basic hash methods such as in the following example:

class A:
    def __init__(self, a):
        self.a = a
    def __hash__(self):
        return hash(self.a)
    def __eq__(self, other):
        if not isinstance(other, A):
            return NotImplemented
        return (self.a == other.a and self.__class__ == 
                other.__class__)

# get an instance of class A
mya = A(42)
# define dict
d1={mya:[1,2], 'foo':[3,4]}

If I time the access through the two different keys I observe a significant difference in performance

%timeit d1['foo']

results in ~ 100 ns. Whereas

%timeit d1[mya]

results in ~ 600 ns.

If I remove the overwriting of __hash__ and __eq__ methods, performance is at the same level as for a default object

Is there a way to avoid this loss in performance and still implement a customised hash calculation ?

like image 992
Paolo Gervasoni Vila Avatar asked Oct 16 '25 20:10

Paolo Gervasoni Vila


1 Answers

The default CPython __hash__ implementation for a custom class is written in C and uses the memory address of the object. Therefore, it does not have to access absolutely anthing from the object and can be done very quickly, as it is just a single integer operation in CPU, if even that.

The "very basic" __hash__ from the example is not as simple as it may seem:

def __hash__(self):
    return hash(self.a)

This has to read the attribute a of self, which I'd say in this case will call object.__getattribute__(self, 'a'), and that will look for the value of 'a' in __dict__. This already involves calculating hash('a') and looking it up. Then, the returned value will be passed to hash.


To answer the additional question:

Is there a way to implement a faster __hash__ method that returns predictable values, I mean that are not randomly computed at each run as in the case of the memory address of the object ?

Anything accessing attributes of objects will be slower than the implementation which does not need to access attributes, but you could make attribute access faster by using __slots__, or implementing a highly optimized C extension for the class.

There is, however, another question: is this really a problem? I cannot really believe that an application is becoming slow because of slow __hash__. __hash__ should still be pretty fast unless the dictionary has trillions of entries, but then, everything else would become slow and ask for bigger changes...


I did some testing and have to make a correction. Using __slots__ is not going to help in this case at all. My tests actually showed that in CPython 3.7 the above class becomes slightly slower when using __slots__.

like image 57
zvone Avatar answered Oct 18 '25 09:10

zvone



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!