Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?
I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter
, similar to the advice here: Python hashable dicts, which recommends a hash function like so:
class FrozenCounter(collections.Counter):
# ...
def __hash__(self):
return hash(tuple(sorted(self.items())))
Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.
I'm thinking of using this algorithm:
def __hash__(self):
return functools.reduce(lambda a, b: a ^ b, self.items(), 0)
I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.
Thoughts? Suggestions? Thanks.
set()
, but the things must be hashable.)
Both @marcin and @senderle gave pretty much the same answer: use hash(frozenset(self.items()))
. This makes sense because items()
"views" are set-like. @marcin was first but I gave the check mark to @senderle because of the good research on the big-O running times for different solutions. @marcin also reminds me to include an __eq__
method -- but the one inherited from dict
will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:
class FrozenCounter(collections.Counter):
# Edit: A previous version of this code included a __slots__ definition.
# But, from the Python documentation: "When inheriting from a class without
# __slots__, the __dict__ attribute of that class will always be accessible,
# so a __slots__ definition in the subclass is meaningless."
# http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
# ...
def __hash__(self):
"Implements hash(self) -> int"
if not hasattr(self, '_hash'):
self._hash = hash(frozenset(self.items()))
return self._hash
In Python, the Dictionary data types represent the implementation of hash tables. The Keys in the dictionary satisfy the following requirements. The keys of the dictionary are hashable i.e. the are generated by hashing function which generates unique result for each unique value supplied to the hash function.
Values can be any type of object, but keys must be immutable. This means keys could be integers, strings, or tuples, but not lists, because lists are mutable. Dictionaries themselves are mutable, so entries can be added, removed, and changed at any time.
One of the key elements that make blockchain immutable is cryptographic hashes, which is why blockchain is immutable. The main advantage of hash is that it cannot be reverse-engineered.
Lists and dictionaries are unhashable because we cannot call the hash() method on them. Let's see some examples. 01:35 Let's look at strings. We can create a string like this, and we can see that strings are indeed immutable.
Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a frozenset
from items
(in 3+; iteritems
in 2.7), hash it, and store the hash.
To provide an explicit example:
>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990
To clarify why I prefer a frozenset
to a tuple of sorted items: a frozenset
doesn't have to sort the items, and so the initial hash completes in O(n) time rather than O(n log n) time. This can be seen from the frozenset_hash
and set_next
implementations.
See also this great answer from Raymond Hettinger describing his implementation of the frozenset
hash function. There he explicitly explains how the hash function avoids having to sort values to get a stable, order insensitive value.
Have you considered hash(sorted(hash(x) for x in self.items()))
? That way, you are only sorting integers, and don't have to build a list.
You could also xor the element hashes together, but frankly I don't how well that would work (would you have a lot of collisions?). Speaking of collisions, don't you have to implement the __eq__
method?
Alternatively, similar to my answer here, hash(frozenset(self.items()))
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With