Python's underlying hash data structure for dictionaries

Question

I am build a very large dictionary and I am performing many checks to see if a key is in the structure and then adding if it unique or incrementing a counter if it is identical.

Python uses a hash data structure to store dictionaries (not to be confused with a cryptographic hash function). Lookups are O(1), but if the hash table is full the it has to be rehashed which is very expensive.

My Question is, would I be better off using a AVL Binary Search Tree or is a hash table good enough?

Gareth Rees · Accepted Answer

The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.

If you take a look at the Python dictionary implementation, you'll see that:

a dictionary starts out with 8 entries (PyDict_MINSIZE);
a dictionary with 50,000 or fewer entries quadruples in size when it grows;
a dictionary with more than 50,000 entries doubles in size when it grows;
key hashes are cached in the dictionary, so they are not recomputed when the dictionary is resized.

(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)

So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.

pixelbeat · Answer

What's the ratio of items vs unique items? What's the expected number of unique items?

If a hash bucket fills, then extending should just be a matter of some memory reallocation, not rehashing.

Testing a counting dict should be very quick and easy to do.

Note also the counter class available since python 2.7 http://docs.python.org/library/collections.html#counter-objects http://svn.python.org/view?view=rev&revision=68559

Python's underlying hash data structure for dictionaries

Tags:

performance

python

algorithm

data-structures

rook

2 Answers

Gareth Rees

pixelbeat

Recent Activity

Donate For Us

Python's underlying hash data structure for dictionaries

Tags:

performance

python

algorithm

data-structures

rook

2 Answers

Gareth Rees

pixelbeat

Related questions

Recent Activity

Donate For Us