Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dictionaries and Hashtables space complexity [duplicate]

I have some confusion over dictionaries and hash tables that I wanted to clarify .Suppose I have the current dictionary and the current output of the hashes of the current python run.

Dict = dict()
print(hash('a'))
print(hash('b'))
print(hash('c'))
Dict['a'] = 1
Dict['b'] = 2
Dict['c'] = 3
print(Dict)

has the output of

1714333803
1519074822
1245896149
{'a': 1, 'c': 3, 'b': 2}

So to my knowledge a hashtable is simply an array where the hash is the index of the hashtable. For instance, 'a' had a hash of 1714333803, therefore my hashtable index 1714333803 has a value of 'a'. So im confused how many indexes a hashtable has and how the hash function produces the answer? Does it use modulus and have a fixed range of indexes? Because the given print of the dictionary outputs {'a': 1, 'c': 3, 'b': 2}, but is it correct to assume that eventhough it outputs that, the dictionary is actually an array of atleast 1714333803 indexes, because that seems ridiculously overkill to contain 3 elements and not to mention how much of a waste of space it is. Also for the hashtable, what is in the indexes that has no value, null?

like image 584
memelord23 Avatar asked Jul 01 '16 07:07

memelord23


1 Answers

The actual size of the dict depends on the implementation, but in your case, it is probably 8. So, how does this work?

The working principle of a dict (or hash map in general) is to calculate a numerical hash for every key. In your case, that's hash("a") == 1714333803, for example. Now, the hash isn't used directly as an index. Instead, it is mapped to the size of the dictionary.

A simple method to do this is modulo (%). Let's say your dict is 8 in size. Then hash("a") % 8 == 1714333803 % 8 == 3. So your item is actually at the 4th position. By construction of the lookup algorithm, no item can ever have an index outside the array.

There are some more complex things here, like hash collisions. For example, if another item has hash 98499, that also maps to 3. There are collision resolution strategies which pick a different index in this case. They mostly try to uniformly walk the array in large strides.

So, why is your dict of size 8? Because that's the default size in python. Once your dict get's too small, it must be resized. In contrast to arrays, this is done before the dict is actually full - namely, at two thirds filling. This is done to reduce hash collisions - if your dict is 99% full, a collision is practically guaranteed. For a size 8 dict, you'd have to enter 5-6 items before it resizes, namely doubles its capacity to 16.


Note that CPython 3.6+ and PyPy (for a long time) use a two-stage data structure for dict. The first stage is a hashtable, but the second stage is not. This splits up the key mapping (stage one) and data storage (stage two). The sparse first stage provides the index for the tightly packed second stage:

# based on Raymond Hettingers mail on python-dev
# the key mapping, using a hashtable
# indices[hash(key) % length] => data index
indices =  [None, None, None, 0, None, 2, 1, None]

# the data storage, packed in insertion order
# entries[index] => hash(key), key, value
entries =  [[1714333803, 'a', 1],
            [1519074822, 'b', 2],
            [1245896149, 'c', 3]]

This scheme is algorithmically more complex for lookups (due to the indirection) but less so for iteration (directly on the data storage) and more memory efficient. Only the index table is sparse and needs to be oversized. The data storage is exactly as large as needed, unless items are deleted.

like image 172
MisterMiyagi Avatar answered Oct 02 '22 10:10

MisterMiyagi