Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the size of a prefix tree (trie) that contains all the english words?

Knowing that there are about 200k words in the english dictionary and the alphabet has 26 letters or so.

like image 719
Raul Avatar asked Mar 04 '14 21:03

Raul


2 Answers

In this article, the author built a trie of English words from a file that was 935,015 bytes long. It required a quarter million nodes. He claims a compression ratio of approximately 73%, which is pretty close to what I remember from my work with such data structures.

Note that his implementation wastes a lot of memory by storing an array of 26 child pointers for every node. A much less expensive implementation would maintain just the pointers it needs, ordered by frequency of use. For example, it's kind of crazy to store 26 child node pointers for the letter q in a word, considering that it's highly unlikely that the character after q will be anything but u.

Sequentially searching that would take slightly longer than directly indexing an array, but it would make for considerable savings in memory. And the memory savings could result in a lot fewer cache misses, which could very well more than make up for the increased cost of linear search.

If you're interested in saving even more space, you can create a Directed Acyclic Word Graph, which also takes advantage of common endings as well as some other optimizations. For example, you can compress dangling endings into a single node.

like image 58
Jim Mischel Avatar answered Oct 16 '22 13:10

Jim Mischel


Using a simple prefix tree, the space requirement should be O(N*C) where C is the average number of characters per word and N is the number of words. This is because in the worst case, a Trie will store every character in each word. So a fair estimate would be around 1 million characters stored, or around 1 MB.

like image 25
Nuclearman Avatar answered Oct 16 '22 12:10

Nuclearman