What's a correct and good way to implement hash()?

People also ask

What does __ hash __ do?

Introduction to the Python hash function The hash() function accepts an object and returns the hash value as an integer. When you pass an object to the hash() function, Python will execute the __hash__ special method of the object.

What are hash values in Python?

Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Python dictionaries are implemented as hash tables. So any time you use a dictionary, hash() is called on the keys that you pass in for assignment, or look-up.

An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.

Here's an example of using a key for hash and equality:

class A:
    def __key(self):
        return (self.attr_a, self.attr_b, self.attr_c)

    def __hash__(self):
        return hash(self.__key())

    def __eq__(self, other):
        if isinstance(other, A):
            return self.__key() == other.__key()
        return NotImplemented

Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.

John Millikin proposed a solution similar to this:

class A(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        return (isinstance(othr, type(self))
                and (self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))

    def __hash__(self):
        return hash((self._a, self._b, self._c))

The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?

Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states

The only required property is that objects which compare equal have the same hash value

Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.

Outdated/bad solution

~~The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR~~, which gives us this:

class B(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        if isinstance(othr, type(self)):
            return ((self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))
        return NotImplemented

    def __hash__(self):
        return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
                hash((self._a, self._b, self._c)))

Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).

Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that

for c in some_string:
    hash = 101 * hash  +  ord(c)

worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.

I can try to answer the second part of your question.

The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.

In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!

Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.

Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.

In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.

Here are some links if you're interested more:

coalesced hashing on wikipedia

Wikipedia also has a summary of various collision resolution methods:

Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.

A very good explanation on when and how implement the __hash__ function is on programiz website:

Just a screenshot to provide an overview: (Retrieved 2019-12-13)

Screenshot of https://www.programiz.com/python-programming/methods/built-in/hash 2019-12-13

As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.

class Person:
def __init__(self, age, name):
    self.age = age
    self.name = name

def __eq__(self, other):
    return self.age == other.age and self.name == other.name

def __hash__(self):
    print('The hash is:')
    return hash((self.age, self.name))

person = Person(23, 'Adam')
print(hash(person))

A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:

class A(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __iter__(self):
        yield "a", self._a
        yield "b", self._b
        yield "c", self._c

    def __hash__(self):
        return hash(tuple(self))

    def __eq__(self, other):
        return (isinstance(other, type(self))
                and tuple(self) == tuple(other))

(here __eq__ is not required for hash, but it's easy to implement).

Now add some mutable members to see how it works:

a = 2; b = 2.2; c = 'cat'
hash(A(a, b, c))  # -5279839567404192660
dict(A(a, b, c))  # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(A(a, b, c))  # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(A(a, b, c)) # (('a', 2), ('b', 2.2), ('c', 'cat'))

things only fall apart if you try to put non-hashable members in the object model:

hash(A(a, b, [1]))  # TypeError: unhashable type: 'list'

Related questions
                            
                                django import error - No module named core.management
                            
                                Difference between exit(0) and exit(1) in Python
                            
                                Find element's index in pandas Series
                            
                                Python Request Post with param data
                            
                                Correct way to pause a Python program
                            
                                How can you dynamically create variables? [duplicate]
                            
                                Identifying the dependency relationship for python packages installed with pip
                            
                                Django self-referential foreign key
                            
                                Binary search (bisection) in Python
                            
                                Accessing Object Memory Address
                            
                                Plotting a 2D heatmap with Matplotlib
                            
                                python: SyntaxError: EOL while scanning string literal
                            
                                Why can tuples contain mutable items?
                            
                                Where do I call the BatchNormalization function in Keras?
                            
                                Iterate over model instance field names and values in template
                            
                                How to create key or append an element to key?
                            
                                How to save a dictionary to a file?
                            
                                How to limit the maximum value of a numeric field in a Django model?
                            
                                What is the relationship between virtualenv and pyenv?
                            
                                How to change default Anaconda python environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's a correct and good way to implement hash()?

Tags:

python

dictionary

hashtable

hashcode

People also ask

Outdated/bad solution

Recent Activity

Donate For Us

What's a correct and good way to implement __hash__()?

Tags:

python

dictionary

hashtable

hashcode

People also ask

Outdated/bad solution

Related questions

Recent Activity

Donate For Us

What's a correct and good way to implement hash()?