How to avoid rehashing dict keys?

Question

The script below illustrates a capability of set and frozenset that I would like to understand, and, if possible replicate in a subclass of collections.MutableSet. (BTW, this feature is not just a oddity of set and frozenset: it is actively verified in Python's unit tests for these types.)

The script performs the following steps for each of several types/classes of set-like objects:

create a dict d whose n keys are specially instrumented integers that keep track of how many times their __hash__ method is invoked (d's values are all None, but this is irrelevant);
compute (and save for later) the cumulative number of times the __hash__ method of d's keys has been called so far (i.e. during the creation of d);
create an object s of the current set-like type/class, using d as the argument to the constructor (hence, d's keys will become the contents of the resulting object, whereas d's values will be ignored);
redo the calculation described in (2);
output the results from calculations from (2) and (4) above.

Here's the output for the case where n is set to 10 for all types/classes (I give the full code at the end of this post):

set: 10 10
frozenset: 10 10
Set: 10 20
myset: 10 20

The conclusion is clear: constructing a set or a frozenset from d does not require invoking the __hash__ method of d's keys, hence the call counts remain unchanged after these constructors return. This is not the case, however, when instances of Set or myset are created from d. In each of these cases it appears that each of d's keys' __hash__ is invoked once.

How can I modify myset (see below) so that running its constructor with d as its argument results in no calls to d's keys' hash methods?

Thanks!

from sets import Set
from collections import MutableSet

class hash_counting_int(int):
    def __init__(self, *args):
        self.count = 0
    def __hash__(self):
        self.count += 1
        return int.__hash__(self)

class myset(MutableSet):
    def __init__(self, iterable=()):
        # The values of self.dictset matter!  See further notes below.
        self.dictset = dict((item, i) for i, item in enumerate(iterable))

    def __bomb(s, *a, **k): raise NotImplementedError
    add = discard = __contains__ = __iter__ = __len__ = __bomb

def test_do_not_rehash_dict_keys(thetype, n=1):
    d = dict.fromkeys(hash_counting_int(k) for k in xrange(n))
    before = sum(elem.count for elem in d)
    s = thetype(d)
    after = sum(elem.count for elem in d)
    return before, after

for t in set, frozenset, Set, myset:
    before, after = test_do_not_rehash_dict_keys(t, 10)
    print '%s: %d %d' % (t.__name__, before, after)

Note that, the values of self.dictset are integers, and are pointedly not the same as the (ignored) iterable.values() (in those cases where iterable.values actually exists)! This is an attempt (admittedly feeble) to indicate that, even when iterable is a dict (which need not be the case) and its values are ignored, in the real code that this example is standing in for, the values of self.dictset are always significant. This means that any solution based on using self.dictset.update(iterable) still has to solve the problem of assigning the proper values to its keys, and once again one faces the problem of iterating over these keys without invoking their __hash__ methods. (In addition, solutions based on self.dictset.update(iterable) also have to solve the problem of properly handling the case when iterable is not a suitable argument for self.dictset.update, although this problem is not insurmountable.)

Edits: 1) clarified the significance of myset.dictset's values; 2) renamed myset.__bomb__ to myset.__bomb.

senderle · Accepted Answer

At the most basic level, it's rehashing the keys because you pass a genex to dict instead of a mapping.

You could try this:

class myset(MutableSet):
    def __init__(self, iterable=()):
        self.dictset = {}
        self.dictset.update(iterable)

    def __bomb__(s, *a, **k): raise NotImplementedError
    add = discard = __contains__ = __iter__ = __len__ = __bomb__

Output:

set: 10 10
frozenset: 10 10
Set: 10 20
myset: 10 10

update accepts a genex too, but if iterable is a mapping, Python is smart enough not to rehash the keys. Indeed, you don't even have to create the dict separately as above. You can just do dict(mapping) as long as you don't encapsulate it within a genex. But you've indicated that you also want to change the value associated with the key. This is possible in a sense, with dict.fromkeys(mapping, default_val): you can specify a default value in that case, and all keys will take that value, but because you pass a mapping, nothing will be rehashed. But that's still not enough, I'm guessing; you seem to want to give a new and unique value to each key.

So your real question is, very simply, is it possible to assign a new value to a key without rehashing the key. And when phrased that way, perhaps you can see that it's not possible in a straightforward way.

In general, there's no built-in way to change the value of an arbitrary key:value pair without rehashing the key. This is for two reasons:

When assigning a value to an arbitrary key, Python needs to know both the key and its hash, in case there's a collision. Python could allow you to pass both a key and a pre-calculated hash, but then you could really screw things up by passing an inconsistent hash. So it's better for all of us to let Python do the bookkeeping there. The overhead of calling __hash__ is worth it. (Note that in at least some cases, Python caches the hash — in those cases, this just looks up the cached hash.)
The other way to change a value would be to change the pointer value stored at some particular memory address that the dict points to, which you have saved and associated with the key. This, very simply, involves exposing way too much of Python's internal structure. However, this approach is the basis of a hackish solution detailed below.

Now, Python itself can efficiently merge two dictionaries by manipulating dict internals, because 1 isn't valid in that case; it is guaranteed that all collisions have been dealt with already! But again, those internals shouldn't be exposed. With fromkeys, Python probably does something akin to 2 internally, but the default value is always the same. I can imagine a situation where Python would offer another keyword extension to fromkeys that would accept a function instead of a default value; it would call the function with the associated key and use the returned value. That would be cool. But it doesn't exist.

So our only hope is to do something hackish. Since we very simply can't change the value associated with a dict key without rehashing, we'll simply have to associate the key with a mutable value.

>>> a = dict((hash_counting_int(x), []) for x in range(10))
>>> [x.count for x in a.keys()]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> b = dict(a)
>>> [x.count for x in a.keys()]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> for n, v in enumerate(b.itervalues()):
...     v.append(n)
... 
>>> [x.count for x in a.keys()]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> b
{0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5], 6: [6], 7: [7], 8: [8], 9: [9]}

This is, unfortunately, the only possible solution that doesn't involve mucking about in dict's internals. And I hope you agree that it's not a very good one.

How to avoid rehashing dict keys?

Tags:

python

set

kjo

1 Answers

senderle

Recent Activity

Donate For Us

How to avoid rehashing dict keys?

Tags:

python

set

kjo

1 Answers

senderle

Related questions

Recent Activity

Donate For Us