Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overriding Python's Hashing Function in Dictionary

I am trying to create a custom hash function for some object that I'll be hashing into a dictionary. The hashing function is unique (not the standard Python one). This is very important to me: to use the unique function. Each key's value is a list.

Assuming I override __hash__ and end up coming up with the right hash number for an object. Would:

dict = {}
dict[number_here] = value

hash the value into the position number number_here, or would it still be at the position that Python's hash table would compute for that number?

Printing dict only shows the items and not which position they're that. However, when I do hash(4), the result is 4. So I'm assuming this means integers are hashed to their respective locations?

Could someone please verify my findings or explain to me if I'm wrong?

like image 952
darksky Avatar asked Nov 22 '12 14:11

darksky


1 Answers

The python dict implementation uses the hash value to both sparsely store values based on the key and to avoid collisions in that storage. It uses the result of hash() as a starting point, it is not the definitive position.

Thus, although hash(4) returns 4, the exact 'position' in the underlying C structure is also based on what other keys are already there, and how large the current table is. The python hash table is resized as needed (items added), for example.

Since a dict has no ordering, this is not something you need to worry about, or can hope to influence. If you need ordering in a dict, use the collections.OrderedDict() implementation instead, which keeps track of ordering separately.

The details of the python hash table implementation

You may want to read up on how hash tables work on Wikipedia; Python uses open addressing for it's implementation.

When selecting a slot in the table, the modulus of the hash value (an integer) and the current table size is taken, thus on a table of size 32, so the key 45, hash value 45 would initially be stored in slot 14.

If there is a collision (there already is something else stored in slot 14 and it's not the integer 45), then the slot value is perturbed until an empty slot is found or the same key is found instead. Perturbing is done with the formula:

perturb = slot = hash
while slot_is_full and item_in_slot_is_not_equal_to_key:
    slot = (5*slot) + 1 + perturb
    perturb >>= 5

So, when there is a collision, another slot is picked at progressively smaller steps until it scans the whole table. Note that the table will already have been resized to make space if necessary.

In order for this to work correctly, custom types need both a __hash__() method and need to implement __eq__() to determine if two instances represent the same key. Matching hash values is not enough. For the dict implementation to consider two instances to represent the exact same key, both their hash values must match, and they must return True for the == equality operator. Such objects are considered hashable.

(For Python 2.x, implementing the __cmp__() hook would do instead of implementing __eq__(); support for this has been removed in Python 3).

like image 200
Martijn Pieters Avatar answered Nov 15 '22 15:11

Martijn Pieters