The script below illustrates a capability of set
and frozenset
that I would like to understand, and, if possible replicate in a subclass of collections.MutableSet. (BTW, this feature is not just a oddity of set
and frozenset
: it is actively verified in Python's unit tests for these types.)
The script performs the following steps for each of several types/classes of set-like objects:
d
whose n
keys are specially instrumented integers that keep track of how many times their __hash__
method is invoked (d
's values are all None
, but this is irrelevant);__hash__
method of d
's keys has been called so far (i.e. during the creation of d
);s
of the current set-like type/class, using d
as the argument to the constructor (hence, d
's keys will become the contents of the resulting object, whereas d
's values will be ignored);Here's the output for the case where n
is set to 10 for all types/classes (I give the full code at the end of this post):
set: 10 10
frozenset: 10 10
Set: 10 20
myset: 10 20
The conclusion is clear: constructing a set
or a frozenset
from d
does not require invoking the __hash__
method of d
's keys, hence the call counts remain unchanged after these constructors return. This is not the case, however, when instances of Set
or myset
are created from d
. In each of these cases it appears that each of d
's keys' __hash__
is invoked once.
How can I modify
myset
(see below) so that running its constructor withd
as its argument results in no calls tod
's keys' hash methods?
Thanks!
from sets import Set
from collections import MutableSet
class hash_counting_int(int):
def __init__(self, *args):
self.count = 0
def __hash__(self):
self.count += 1
return int.__hash__(self)
class myset(MutableSet):
def __init__(self, iterable=()):
# The values of self.dictset matter! See further notes below.
self.dictset = dict((item, i) for i, item in enumerate(iterable))
def __bomb(s, *a, **k): raise NotImplementedError
add = discard = __contains__ = __iter__ = __len__ = __bomb
def test_do_not_rehash_dict_keys(thetype, n=1):
d = dict.fromkeys(hash_counting_int(k) for k in xrange(n))
before = sum(elem.count for elem in d)
s = thetype(d)
after = sum(elem.count for elem in d)
return before, after
for t in set, frozenset, Set, myset:
before, after = test_do_not_rehash_dict_keys(t, 10)
print '%s: %d %d' % (t.__name__, before, after)
Note that, the values of self.dictset
are integers, and are pointedly not the same as the (ignored) iterable.values()
(in those cases where iterable.values
actually exists)! This is an attempt (admittedly feeble) to indicate that, even when iterable
is a dict (which need not be the case) and its values
are ignored, in the real code that this example is standing in for, the values
of self.dictset
are always significant. This means that any solution based on using self.dictset.update(iterable)
still has to solve the problem of assigning the proper values to its keys, and once again one faces the problem of iterating over these keys without invoking their __hash__
methods. (In addition, solutions based on self.dictset.update(iterable)
also have to solve the problem of properly handling the case when iterable
is not a suitable argument for self.dictset.update
, although this problem is not insurmountable.)
Edits: 1) clarified the significance of myset.dictset's values; 2) renamed myset.__bomb__
to myset.__bomb
.
At the most basic level, it's rehashing the keys because you pass a genex to dict
instead of a mapping.
You could try this:
class myset(MutableSet):
def __init__(self, iterable=()):
self.dictset = {}
self.dictset.update(iterable)
def __bomb__(s, *a, **k): raise NotImplementedError
add = discard = __contains__ = __iter__ = __len__ = __bomb__
Output:
set: 10 10
frozenset: 10 10
Set: 10 20
myset: 10 10
update
accepts a genex too, but if iterable
is a mapping, Python is smart enough not to rehash the keys. Indeed, you don't even have to create the dict separately as above. You can just do dict(mapping)
as long as you don't encapsulate it within a genex. But you've indicated that you also want to change the value associated with the key. This is possible in a sense, with dict.fromkeys(mapping, default_val)
: you can specify a default value in that case, and all keys will take that value, but because you pass a mapping, nothing will be rehashed. But that's still not enough, I'm guessing; you seem to want to give a new and unique value to each key.
So your real question is, very simply, is it possible to assign a new value to a key without rehashing the key. And when phrased that way, perhaps you can see that it's not possible in a straightforward way.
In general, there's no built-in way to change the value of an arbitrary key:value pair without rehashing the key. This is for two reasons:
When assigning a value to an arbitrary key, Python needs to know both the key and its hash, in case there's a collision. Python could allow you to pass both a key and a pre-calculated hash, but then you could really screw things up by passing an inconsistent hash. So it's better for all of us to let Python do the bookkeeping there. The overhead of calling __hash__
is worth it. (Note that in at least some cases, Python caches the hash — in those cases, this just looks up the cached hash.)
The other way to change a value would be to change the pointer value stored at some particular memory address that the dict
points to, which you have saved and associated with the key. This, very simply, involves exposing way too much of Python's internal structure. However, this approach is the basis of a hackish solution detailed below.
Now, Python itself can efficiently merge two dictionaries by manipulating dict
internals, because 1 isn't valid in that case; it is guaranteed that all collisions have been dealt with already! But again, those internals shouldn't be exposed. With fromkeys
, Python probably does something akin to 2 internally, but the default value is always the same. I can imagine a situation where Python would offer another keyword extension to fromkeys
that would accept a function instead of a default value; it would call the function with the associated key and use the returned value. That would be cool. But it doesn't exist.
So our only hope is to do something hackish. Since we very simply can't change the value associated with a dict key without rehashing, we'll simply have to associate the key with a mutable value.
>>> a = dict((hash_counting_int(x), []) for x in range(10))
>>> [x.count for x in a.keys()]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> b = dict(a)
>>> [x.count for x in a.keys()]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> for n, v in enumerate(b.itervalues()):
... v.append(n)
...
>>> [x.count for x in a.keys()]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> b
{0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5], 6: [6], 7: [7], 8: [8], 9: [9]}
This is, unfortunately, the only possible solution that doesn't involve mucking about in dict
's internals. And I hope you agree that it's not a very good one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With