Knowing that there are about 200k words in the english dictionary and the alphabet has 26 letters or so.
In this article, the author built a trie of English words from a file that was 935,015 bytes long. It required a quarter million nodes. He claims a compression ratio of approximately 73%, which is pretty close to what I remember from my work with such data structures.
Note that his implementation wastes a lot of memory by storing an array of 26 child pointers for every node. A much less expensive implementation would maintain just the pointers it needs, ordered by frequency of use. For example, it's kind of crazy to store 26 child node pointers for the letter q
in a word, considering that it's highly unlikely that the character after q
will be anything but u
.
Sequentially searching that would take slightly longer than directly indexing an array, but it would make for considerable savings in memory. And the memory savings could result in a lot fewer cache misses, which could very well more than make up for the increased cost of linear search.
If you're interested in saving even more space, you can create a Directed Acyclic Word Graph, which also takes advantage of common endings as well as some other optimizations. For example, you can compress dangling endings into a single node.
Using a simple prefix tree, the space requirement should be O(N*C) where C is the average number of characters per word and N is the number of words. This is because in the worst case, a Trie will store every character in each word. So a fair estimate would be around 1 million characters stored, or around 1 MB.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With