Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python __hash__: identity vs. equivalence

I have a class whose instances are to be distinguished by an identity that is different from the data values they carry. In my code, I intend to use == to mean two instances are equivalent regarding their data, and is to mean two variables refer to the same instance, that is, that they are identical. This is all quite normal, to my understanding.

Furthermore I want to use instances (equivalent or not) in sets, and as keys in dicts. That requires the __hash__ function to be defined for the class.

But in this regard I don't understand the documented requirement of __hash__:

The only required property is that objects which compare equal have the same hash value.

Does that mean that two distinct but equivalent objects cannot be used as different keys in a dict, or appear individually in a set? In the code below I break that requirement by overriding __eq__ and __hash__ to reflect my intended use. It works as expected in Python 2.7 and 3.7.

What are the negative consequences of breaking the requirement of __hash__ as I've done here?

Is there a better way to accomplish my goal?

class A( object ):
        def __init__( self, v1, v2 ):
                self.v = ( v1, v2 )

        def __eq__( self, b ):
                return self.v[0] == b.v[0] and self.v[1] == b.v[1]

        def __hash__( self ):
                return id( self )

        def __str__( self ):
                return str( self.v )

p = A( 1, 0 )
q = A( 1, 0 )

print( str( p ), str( q ) )
print( "identical?", p is q )
print( "equivalent?", p == q )
print( "hashes", hash(p), hash(q) )

s = set( [p, q] )
print( "set length", len( s ) )
print( "both in set?", p in s, q in s )

d = { p:3, q:4 }
print( "dict length", len( d ) )
print( "as distinct keys", d[p], d[q] )
like image 780
Steve White Avatar asked Nov 25 '25 06:11

Steve White


1 Answers

The only required property is that objects which compare equal have the same hash value.

The "compare equal" in the spec text means the result of their __eq__ methods - there is no requirement that they are the same object.

The __hash__, thought, have to be based in the values that are used in __eq__, not in the object's "id" - that part is incorrect in your code. For it to work, this is how it would have to be:

Just do:

      ...
      def __eq__( self, b ):
           return self.v[0] == b.v[0] and self.v[1] == b.v[1]

      def __hash__( self ):
           return hash((self.v[0], self.v[1]))

Does that mean that two distinct but equivalent objects cannot be used as different keys in a dict, or appear individually in a set?

Yes. This is what the spec means.

The workaround for that is to leave the default __eq__ implementation for your class to conform to the rules, and implement an alternate form of comparison that you will have to use in your code.

The most straightforward way is just to leave the default implementation of __eq__ as it is, which compares by identity, and have an arbitrary method that you use for comparison,( the idiom that code in languages that do not support operator overriding have to use anyway):

class A( object ):
    ...
    def equals( self, b ):
       return self.v[0] == b.v[0] and self.v[1] == b.v[1]

p = A( 1, 0 )
q = A( 1, 0 )

print( str( p ), str( q ) )
print( "identical?", p is q )
print( "equivalent?", p.equals(q) )

There are ways to improve a little on this - but the baseline is: __eq__ have to conform to the rules, and do an identity comparison.

One way is to have an internal associated object that works as a "namespace" which you can use for comparison:

class CompareSpace:
    def __init__(self, parent):
        self.parent = parent

        def __eq__( self, other ):
            other = other.parent if isinstance(other, type(self)) else other 
            return self.parent.v[0] == other.v[0] and other.v[1] == b.parent.v[1]


    class A:
        def __init__( self, v1, v2 ):
            self.v = ( v1, v2 )
            self.comp = CompareSpace(self)

        def __str__( self ):
            return str( self.v )

p = A( 1, 0 )
q = A( 1, 0 )

print( str( p ), str( q ) )
print( "identical?", p is q )
print( "equivalent?", p.comp == q )
print( "hashes", hash(p), hash(q) )

demonstration of brokenness

Now I will insert an example of how this breaks - I am creating a class deliberatly more broken, to ensure the problem occurs at first try. But if the problem occurs even once each 2 million times, your code will still be too broke to use in anything real, even if personal code: you will have a dictionary that is not deterministic:


class Broken:
    def __init__(self, name):
        self.name = name
    def __hash__(self):
        return id(self) % 256
    def __eq__(self, other):
        return True
    def __repr__(self):
        return self.name


In [23]: objs = [Broken(f"{i:02d}") for i in range(64)]                                        

In [24]: print(objs)                                                                           
[00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]

In [25]: test = {}                                                                             

In [26]: for obj in objs: 
    ...:     if obj not in test: 
    ...:         test[obj] = 0 
    ...:                                                                                       

In [27]: print(test)                                                                           
{00: 0, 01: 0, 02: 0, 11: 0}

# Or with simple, unconditional, insertion:
In [29]: test = {obj: i for i, obj in enumerate(objs)}                                         

In [30]: test                                                                                  
Out[30]: {00: 57, 01: 62, 02: 63, 11: 60}

(I repeat, while your has values won't clash by themselves, internal dict code have to reduce the number in the hash to an index in to its hash table - not necessarily by module (%) - otherwise every empty dict would need 2 ** 64 empty entries, and only if all hashes were only 64bit wide)

like image 59
jsbueno Avatar answered Nov 27 '25 19:11

jsbueno



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!