Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items? I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class <code>collections.Counter</code>, similar to the advice here: Python hashable dicts, which recommends a hash function like so: <pre class="prettyprint lang-python prettyprint-override"><code>class FrozenCounter(collections.Counter): # ... def __hash__(self): return hash(tuple(sorted(self.items()))) </code></pre> Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able. I'm thinking of using this algorithm: <pre class="prettyprint lang-python prettyprint-override"><code>def __hash__(self): return functools.reduce(lambda a, b: a ^ b, self.items(), 0) </code></pre> I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it. Thoughts? Suggestions? Thanks. <hr> (If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a <code>set()</code>, but the things must be hashable.) <hr> <h3>What I've gathered from the comments</h3> Both @marcin and @senderle gave pretty much the same answer: use <code>hash(frozenset(self.items()))</code>. This makes sense because <code>items()</code> "views" are set-like. @marcin was first but I gave the check mark to @senderle because of the good research on the big-O running times for different solutions. @marcin also reminds me to include an <code>__eq__</code> method -- but the one inherited from <code>dict</code> will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome: <pre class="prettyprint lang-python prettyprint-override"><code>class FrozenCounter(collections.Counter): # Edit: A previous version of this code included a __slots__ definition. # But, from the Python documentation: "When inheriting from a class without # __slots__, the __dict__ attribute of that class will always be accessible, # so a __slots__ definition in the subclass is meaningless." # http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots # ... def __hash__(self): "Implements hash(self) -> int" if not hasattr(self, '_hash'): self._hash = hash(frozenset(self.items())) return self._hash </code></pre>

Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a <code>frozenset</code> from <code>items</code> (in 3+; <code>iteritems</code> in 2.7), hash it, and store the hash. To provide an explicit example: <pre class="prettyprint"><code>>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()) frozenset([(3, 2), (1, 3), (4, 1), (2, 1)]) >>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())) -3071743570178645657 >>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems())) -6559486438209652990 </code></pre> To clarify why I prefer a <code>frozenset</code> to a tuple of sorted items: a <code>frozenset</code> doesn't have to sort the items, and so the initial hash completes in O(n) time rather than O(n log n) time. This can be seen from the <code>frozenset_hash</code> and <code>set_next</code> implementations. See also this great answer from Raymond Hettinger describing his implementation of the <code>frozenset</code> hash function. There he explicitly explains how the hash function avoids having to sort values to get a stable, order insensitive value.

Hashing an immutable dictionary in Python

Tags:

python

dictionary

hash

immutability

python-3.2

Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?

I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter, similar to the advice here: Python hashable dicts, which recommends a hash function like so:

class FrozenCounter(collections.Counter):
    # ...
    def __hash__(self):
        return hash(tuple(sorted(self.items())))

Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.

I'm thinking of using this algorithm:

def __hash__(self):
    return functools.reduce(lambda a, b: a ^ b, self.items(), 0)

I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.

Thoughts? Suggestions? Thanks.

(If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a set(), but the things must be hashable.)

What I've gathered from the comments

Both @marcin and @senderle gave pretty much the same answer: use hash(frozenset(self.items())). This makes sense because items() "views" are set-like. @marcin was first but I gave the check mark to @senderle because of the good research on the big-O running times for different solutions. @marcin also reminds me to include an __eq__ method -- but the one inherited from dict will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:

class FrozenCounter(collections.Counter):
    # Edit: A previous version of this code included a __slots__ definition.
    # But, from the Python documentation: "When inheriting from a class without
    # __slots__, the __dict__ attribute of that class will always be accessible,
    # so a __slots__ definition in the subclass is meaningless."
    # http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
    # ...
    def __hash__(self):
        "Implements hash(self) -> int"
        if not hasattr(self, '_hash'):
            self._hash = hash(frozenset(self.items()))
        return self._hash

288

asked Apr 06 '12 15:04

wkschwartz

2 Answers

Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a frozenset from items (in 3+; iteritems in 2.7), hash it, and store the hash.

To provide an explicit example:

>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990

To clarify why I prefer a frozenset to a tuple of sorted items: a frozenset doesn't have to sort the items, and so the initial hash completes in O(n) time rather than O(n log n) time. This can be seen from the frozenset_hash and set_next implementations.

See also this great answer from Raymond Hettinger describing his implementation of the frozenset hash function. There he explicitly explains how the hash function avoids having to sort values to get a stable, order insensitive value.

answered Oct 14 '22 13:10

senderle

Have you considered hash(sorted(hash(x) for x in self.items()))? That way, you are only sorting integers, and don't have to build a list.

You could also xor the element hashes together, but frankly I don't how well that would work (would you have a lot of collisions?). Speaking of collisions, don't you have to implement the __eq__ method?

Alternatively, similar to my answer here, hash(frozenset(self.items())).

answered Oct 14 '22 12:10

Marcin

Related questions
                            
                                Python: insert into list faster than O(N)?
                            
                                what is the IP address of my heroku application
                            
                                flask-sqlalchemy: AttributeError: type object has no attribute 'query', works in ipython
                            
                                Tensorflow `set_random_seed` not working [duplicate]
                            
                                Writing cross-compatible python2/python3 code in pycharm
                            
                                Python pandas linear regression groupby
                            
                                word2vec - what is best? add, concatenate or average word vectors?
                            
                                Dockerfile ADD failed : No Source files were specified
                            
                                pytorch, AttributeError: module 'torch' has no attribute 'Tensor'
                            
                                df.groupby(...).agg(set) produces different result compared to df.groupby(...).agg(lambda x: set(x))
                            
                                Pandas transform inconsistent behavior for list
                            
                                Remove white borders from segmented images
                            
                                Python Regular Expressions to implement string unescaping
                            
                                How do you host your own egg repository?
                            
                                plotting lines without blocking execution
                            
                                Find all coordinates within a circle in geographic data in python
                            
                                How to split a Python module into multiple files?
                            
                                How to include a template with relative path in Jinja2
                            
                                How to translate this Math Formula in Haskell or Python? (Was translated in PHP)
                            
                                How does Moose compare to Python's OO system? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With