I have a simple requirement (perhaps hypothetical):
I want to store english word dictionary (n words) and given a word (character length m), the dictionary is able to tell, if the word exists in dictionary or not. What would be an appropriate data structure for this?
a balanced binary search tree? as done in C++ STL associative data structures like set,map
or
a trie on strings
Some complexity analysis: in a balanced bst, time would be (log n)*m (comparing 2 string takes O(m) time character by character)
in trie, if at each node, we could branch out in O(1) time, we can find using O(m), but the assumption that at each node, we can branch in O(1) time is not valid. at each node, max possible branches would be 26. if we want O(1) at a node, we will keep a short array indexible on characters at each node. This will blow-up the space. After a few levels in the trie, branching will reduce, so its better to keep a linked list of next node characters and pointers.
what looks more practical? any other trade-offs?
Thanks,
Unlike a binary search tree, nodes in the trie do not store their associated key. Instead, a node's position in the trie defines the key with which it is associated. This distributes the value of each key across the data structure, and means that not every node necessarily has an associated value.
The advantage of using the BST is that all major operations (insert, search, and remove) are Θ(logn) in the average case. Of course, if the tree is badly balanced, then the cost can be as bad as Θ(n). Here is an implementation for the Dictionary interface, using a BST to store the records.
A Trie (usually pronounced “try”) is a tree data structure optimized for a specific type of searching. You use a Trie when you want to take a partial value and return a set of possible complete values. The classic example for this is an Autocomplete.
I'd say use a Trie, or better yet use its more space efficient cousin the Directed Acyclic Word Graph (DAWG).
It has the same runtime characteristics (insert, look up, delete) as a Trie but overlaps common suffixes as well as common prefixes which can be a big saving on space.
If this is C++, you should also consider std::tr1::unordered_set
. (If you have C++0x, you can use std::unordered_set
.)
This just uses a hash table internally, which I would wager will out-perform any tree-like structure in practice. It is also trivial to implement because you have nothing to implement.
Binary search is going to be easier to implement and it's only going to involve comparing tens of strings at the most. Given you know the data up front, you can build a balanced binary tree so performance is going to be predictable and easily understood.
With that in mind, I'd use a standard binary tree (probably using set
from C++ since that's typically implemented as a tree).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With