Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python's underlying hash data structure for dictionaries

I am build a very large dictionary and I am performing many checks to see if a key is in the structure and then adding if it unique or incrementing a counter if it is identical.

Python uses a hash data structure to store dictionaries (not to be confused with a cryptographic hash function). Lookups are O(1), but if the hash table is full the it has to be rehashed which is very expensive.

My Question is, would I be better off using a AVL Binary Search Tree or is a hash table good enough?

like image 928
rook Avatar asked Nov 25 '10 17:11

rook


2 Answers

The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.

If you take a look at the Python dictionary implementation, you'll see that:

  1. a dictionary starts out with 8 entries (PyDict_MINSIZE);
  2. a dictionary with 50,000 or fewer entries quadruples in size when it grows;
  3. a dictionary with more than 50,000 entries doubles in size when it grows;
  4. key hashes are cached in the dictionary, so they are not recomputed when the dictionary is resized.

(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)

So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.

like image 151
Gareth Rees Avatar answered Sep 24 '22 01:09

Gareth Rees


What's the ratio of items vs unique items? What's the expected number of unique items?

If a hash bucket fills, then extending should just be a matter of some memory reallocation, not rehashing.

Testing a counting dict should be very quick and easy to do.

Note also the counter class available since python 2.7 http://docs.python.org/library/collections.html#counter-objects http://svn.python.org/view?view=rev&revision=68559

like image 27
pixelbeat Avatar answered Sep 27 '22 01:09

pixelbeat