Is there anything better than a Trie for this situation?
I'm working with Java, so my first attempt was to just use a Set<String>. However, I'm targeting a mobile device and already running low on memory. Since many English words share common prefixes, a trie seems like a decent bet to save some memory -- anyone know some other good options?
EDIT - More info - The data structure will be used for two operations
Thanks for the good suggestions
Arrays. An array is a linear data structure that holds an ordered collection of values. It's the most efficient in storing and accessing a sequence of objects.
1. Arrays. An array is a structure of fixed-size, which can hold items of the same data type. It can be an array of integers, an array of floating-point numbers, an array of strings or even an array of arrays (such as 2-dimensional arrays).
Hashing is one simple option for this. We can put all words in a hash table.
Trie Data Structure It provides a way to store strings efficiently and also to search for them in a lot lesser time complexity.
One structure I saw for minimizing space in a spelling dictionary was to encode each word as:
So the word list
HERE would encode as THIS
sanctimonious 0,sanctimonious
sanction 6,on
sanguine 3,guine
trivial 0,trivial
You're saving 7 bytes straight up there (19%), I suspect the saving would be similar for a 20,000 word dictionary just due to the minimum distances between (common prefixes of) adjacent words.
For example, I ran a test program over a sorted dictionary and calculated old storage space as the word length plus one (for terminator), and the new storage space as one for the common length, the uncommon suffix length and one for a terminator. Here's the final part of that test program showing that you could well over 50%:
zwiebacks -> zygote common= old=1044662 new=469762 55.0%
zygote -> zygotes common=zygote old=1044670 new=469765 55.0%
zygotes -> zygotic common=zygot old=1044678 new=469769 55.0%
zygotic -> zymase common=zy old=1044685 new=469775 55.0%
zymase -> zymogenic common=zym old=1044695 new=469783 55.0%
zymogenic -> zymology common=zymo old=1044704 new=469789 55.0%
zymology -> zymolysis common=zymol old=1044714 new=469795 55.0%
zymolysis -> zymoplastic common=zymo old=1044726 new=469804 55.0%
zymoplastic -> zymoscope common=zymo old=1044736 new=469811 55.0%
zymoscope -> zymurgy common=zym old=1044744 new=469817 55.0%
zymurgy -> zyzzyva common=zy old=1044752 new=469824 55.0%
zyzzyva -> zyzzyvas common=zyzzyva old=1044761 new=469827 55.0%
To speed lookup, there was a 26-entry table in memory which held the starting offsets for words beginning with a, b, c, ..., z. The words at these offsets always had 0 as the first byte as they had no letters in common with the previous word.
This seems to be sort of a trie but without the pointers, which would surely get space-expensive if every character in the tree had a 4-byte pointer associated with it.
Mind you, this was from my CP/M days where memory was much scarcer than it is now.
A Patricia trie may be more appropriate:
http://en.wikipedia.org/wiki/Patricia_tree
My (fuzzy) memory tells me there were used in some of the early full-text search engines ...
Paul.
What are you doing? If it's spell checking, you could use a bloom filter - see this code kata.
You still have to maintain the tree structure itself with Trie. Huffman encoding the alphabet or N-letters (for common forms like "tion", "un", "ing") can take advantage of the occurrence frequency in your dictionary and compress the entry to bits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With