I am build a very large dictionary and I am performing many checks to see if a key is in the structure and then adding if it unique or incrementing a counter if it is identical.
Python uses a hash data structure to store dictionaries (not to be confused with a cryptographic hash function). Lookups are O(1), but if the hash table is full the it has to be rehashed which is very expensive.
My Question is, would I be better off using a AVL Binary Search Tree or is a hash table good enough?
The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.
If you take a look at the Python dictionary implementation, you'll see that:
PyDict_MINSIZE
);(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)
So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.
What's the ratio of items vs unique items? What's the expected number of unique items?
If a hash bucket fills, then extending should just be a matter of some memory reallocation, not rehashing.
Testing a counting dict should be very quick and easy to do.
Note also the counter class available since python 2.7 http://docs.python.org/library/collections.html#counter-objects http://svn.python.org/view?view=rev&revision=68559
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With