Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine hash codes in in Python3?

I am more familiar with the "Java way" of building complex / combined hash codes from superclasses in subclasses. Is there a better / different / preferred way in Python 3? (I cannot find anything specific to Python3 on this matter via Google.)

class Superclass:
    def __init__(self, data):
        self.__data = data

    def __hash__(self):
        return hash(self.__data)

class Subclass(Superclass):
    def __init__(self, data, more_data):
        super().__init__(data)
        self.__more_data = more_data

    def __hash__(self):
        # Just a guess...
        return hash(super()) + 31 * hash(self.__more_data)

To simplifying this question, please assume self.__data and self.__more_data are simple, hashable data, such as str or int.

like image 744
kevinarpe Avatar asked Apr 03 '15 15:04

kevinarpe


1 Answers

The easiest way to produce good hashes is to put your values in a standard hashable Python container, then hash that. This includes combining hashes in subclasses. I'll explain why, and then how.

Base requirements

First things first:

  • If two objects test as equal, then they MUST have the same hash value
  • Objects that have a hash, MUST produce the same hash over time.

Only when you follow those two rules can your objects safely be used in dictionaries and sets. The hash not changing is what keeps dictionaries and sets from breaking, as they use the hash to pick a storage location, and won't be able to locate the object again given another object that tests equal if the hash changed.

Note that it doesn’t even matter if the two objects are of different types; True == 1 == 1.0 so all have the same hash and will all count as the same key in a dictionary.

What makes a good hash value

You'd want to combine the components of your object value in ways that will produce, as much as possible, different hashes for different values. That includes things like ordering and specific meaning, so that two attributes that represent different aspects of your value, but that can hold the same type of Python objects, still result in different hashes, most of the time.

Note that it's fine if two objects that represent different values (won't test equal) have equal hashes. Reusing a hash value won't break sets or dictionaries. However, if a lot of different object values produce equal hashes then that reduces their efficiency, as you increase the likelihood of collisions. Collisions require collision resolution and collision resolution takes more time, so much so that you can use denial of service attacks on servers with predictable hashing implementations) (*).

So you want a nice wide spread of possible hash values.

Pitfalls to watch out for

The documentation for the object.__hash__ method includes some advice on how to combine values:

The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

but only using XOR will not produce good hash values, not when the values whose hashes that you XOR together can be of the same type but have different meaning depending on the attribute they've been assigned to. To illustrate with an example:

>>> class Foo:
...     def __init__(self, a, b):
...         self.a = a
...         self.b = b
...     def __hash__(self):
...         return hash(self.a) ^ hash(self.b)
...
>>> hash(Foo(42, 'spam')) == hash(Foo('spam', 42))
True

Because the hashes for self.a and self.b were just XOR-ed together, we got the same hash value for either order, and so effectively halving the number of usable hashes. Do so with more attributes and you cut the number of unique hashes down rapidly. So you may want to include a bit more information in the hash about each attribute, if the same values can be used in different elements that make up the hash.

Next, know that while Python integers are unbounded, hash values are not. That is to say, hashes values have a finite range. From the same documentation:

Note: hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.

This means that if you used addition or multiplication or other operations that increase the number of bits needed to store the hash value, you will end up losing the upper bits and so reduce the number of different hash values again.

Next, if you combine multiple hashes with XOR that already have a limited range, chances are you end up with an even smaller number of possible hashes. Try XOR-ing the hashes of 1000 random integers in the range 0-10, for an extreme example.

Hashing, the easy way

Python developers have long since wrestled with the above pitfalls, and solved it for the standard library types. Use this to your advantage. Put your values in a tuple, then hash that tuple.

Python tuples use a simplified version of the xxHash algorithm to capture order information and to ensure a broad range of hash values. So for different attributes, you can capture the different meanings by giving them different positions in a tuple, then hashing the tuple:

def __hash__(self):
    return hash((self.a, self.b))

This ensures you get unique hash values for unique orderings.

If you are subclassing something, put the hash of the parent implementation into one of the tuple positions:

def __hash__(self):
    return hash((super().__hash__(), self.__more_data))

Hashing a hash value does reduce it to a 60-bit or 30-bit value (on 32-bit or 64-bit platforms, respectively), but that's not a big problem when combined with other values in a tuple. If you are really concerned about this, put None in the tuple as a placeholder and XOR the parent hash (so super().__hash__() ^ hash((None, self.__more_data))). But this is overkill, really.

If you have a multiple values whose relative order doesn't matter, and don't want to XOR these all together one by one, consider using a frozenset() object for fast processing, combined with a collections.Counter() object if values are not meant to be unique. The frozenset() hash operation accounts for small hash ranges by reshuffling the bits in hashes first:

# unordered collection hashing
from collections import Counter
hash(frozenset(Counter(...).items()))

As always, all values in the tuple or frozenset() must be hashable themselves.

Consider using dataclasses

For most objects you write __hash__ functions for, you actually want to be using a dataclass generated class:

from dataclasses import dataclass
from typing import Union

@dataclass(frozen=True)
class Foo:
    a: Union[int, str]
    b: Union[int, str]

Dataclasses are given a sane __hash__ implementation when frozen=True or unsafe_hash=True, using a tuple() of all the field values.


(*) Python protects your code against such hash collision attacks by using a process-wide random hash seed to hash strings, bytes and datetime objects.

like image 170
Martijn Pieters Avatar answered Sep 28 '22 01:09

Martijn Pieters