From Cracking the Coding Interview, page 71:
Alternatively, we can implement hash table with a BST. We can then guarantee an O(log n) lookup time, since we can keep the tree balanced. Additionally we may use less space, since a large array no longer needs to be allocated in the very beginning.
I know the basics of linked lists, hash tables and BSTs, but I am unable to understand these lines. What does it actually mean? Would this final data structure would be a Trie?
BST are memory efficient but Hash table is not.
Hashing is implemented in two steps: An element is converted into an integer by using a hash function. This element can be used as an index to store the original element, which falls into the hash table. The element is stored in the hash table where it can be quickly retrieved using hashed key.
To insert an element in BST, we have to start searching from the root node; if the node to be inserted is less than the root node, then search for an empty location in the left subtree. Else, search for the empty location in the right subtree and insert the data.
Hashing. For implementing associative arrays, hash tables, a data structure that maps keys to records using a hash function, are generally faster than binary search on a sorted array of records. Most hash table implementations require only amortized constant time on average.
The full text of that section states, with the last paragraph being the one you asked about:
A hash table is a data structure that maps keys to values for highly efficient lookup. In a very simple implementation of a hash table, the hash table has an underlying array and a hash function. When you want to insert an object and its key, the hash function maps the key to an integer, which indicates the index in the array. The object is then stored at that index.
Typically, though, this won't quite work right. In the above implementation, the hash value of all possible keys must be unique, or we might accidentally overwrite data. The array would have to be extremely large—the size of all possible keys—to prevent such "collisions."
Instead of making an extremely large array and storing objects at index hash (key), we can make the array much smaller and store objects in a linked list at index hash (key) % array_length.To get the object with a particular key, we must search the linked list for this key.
Alternatively, we can implement the hash table with a binary search tree. We can then guarantee an 0(log n) lookup time, since we can keep the tree balanced. Additionally, we may use less space, since a large array no longer needs to be allocated in the very beginning.
So they're talking about using a BST (binary search tree) to handle collisions. It wouldn't actually make sense to use a BST as the sole data structure since the whole point of a properly tuned hash is that look-up is on the order of O(1)
, much better than the O(log n)
from a BST. On top of that, using a BST to totally implement a hash table means it's not actually a hash table :-)
However, consider that, when you have collisions in a hash table, a frequent way to handle them is to have each bucket contain a linked list of its items. In the degenerate case (all items hashing to the same bucket), you end up with just a linked list and the O(1)
turns into O(n)
.
So, rather than a linked list at each bucket, you have a BST. Then you no longer have O(n)
search complexity in cases where a single bucket has many items (the previously mentioned collisions).
You use the hash function to find the bucket in O(1)
then search through the BST in O(log n)
if there are collisions. In the best case (one item per bucket), it's still O(1)
. The worst case then becomes O(log n)
rather than O(n)
.
The only thing that originally concerned me about that theory is that they also discuss the fact that a large allocation is no longer necessary. If it's a shared hash/BST combination, you still need to allocate the entire hash table so that seemed incongruous.
However, from the context ("... since a large array no longer needs to be allocated ..."), it appears that they mean they can make the hash table part of the dual data structure smaller as the collisions are more efficient to process. In other words, rather than a 1000-element hash table with linked lists for collisions, you can get away with a 100-element hash table because the collisions are not so damaging to the search time if you use a BST.
You're conflating a few terms here.
The idea would be to implement the hash table with both the array and a BST in a two-tiered fashion. One would still add values into the hash if there were no collision, but if there was, then one could solve the performance of retrieving a collided element with the BST.
A trie is something entirely different; depending on what you were attempting to store, you might not be able to apply it to a hashing function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With