Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Space-Efficient Data Structure for Storing a Word List?

Is there anything better than a Trie for this situation?

  • Storing a list of ~100k English words
  • Needs to use minimal memory
  • Lookups need to be reasonable, but don't have to be lightning fast

I'm working with Java, so my first attempt was to just use a Set<String>. However, I'm targeting a mobile device and already running low on memory. Since many English words share common prefixes, a trie seems like a decent bet to save some memory -- anyone know some other good options?

EDIT - More info - The data structure will be used for two operations

  • Answering: Is some word XYZ in the list?
  • Generating the neighborhood of words around XYZ with one letter different

Thanks for the good suggestions

like image 551
allclaws Avatar asked Dec 11 '08 01:12

allclaws


People also ask

Which data structure is most efficient for in memory?

Arrays. An array is a linear data structure that holds an ordered collection of values. It's the most efficient in storing and accessing a sequence of objects.

What is the best data structure type to store a list of numbers?

1. Arrays. An array is a structure of fixed-size, which can hold items of the same data type. It can be an array of integers, an array of floating-point numbers, an array of strings or even an array of arrays (such as 2-dimensional arrays).

Which is an ideal data structure to store dictionary?

Hashing is one simple option for this. We can put all words in a hash table.

Which data structure for a dictionary full Word Search offers comparatively less time complexity?

Trie Data Structure It provides a way to store strings efficiently and also to search for them in a lot lesser time complexity.


4 Answers

One structure I saw for minimizing space in a spelling dictionary was to encode each word as:

  • the number of characters (a byte) in common with the last; and
  • the new ending.

So the word list

HERE            would encode as    THIS
sanctimonious                      0,sanctimonious
sanction                           6,on
sanguine                           3,guine
trivial                            0,trivial

You're saving 7 bytes straight up there (19%), I suspect the saving would be similar for a 20,000 word dictionary just due to the minimum distances between (common prefixes of) adjacent words.

For example, I ran a test program over a sorted dictionary and calculated old storage space as the word length plus one (for terminator), and the new storage space as one for the common length, the uncommon suffix length and one for a terminator. Here's the final part of that test program showing that you could well over 50%:

zwiebacks   -> zygote      common=        old=1044662 new=469762 55.0%
zygote      -> zygotes     common=zygote  old=1044670 new=469765 55.0%
zygotes     -> zygotic     common=zygot   old=1044678 new=469769 55.0%
zygotic     -> zymase      common=zy      old=1044685 new=469775 55.0%
zymase      -> zymogenic   common=zym     old=1044695 new=469783 55.0%
zymogenic   -> zymology    common=zymo    old=1044704 new=469789 55.0%
zymology    -> zymolysis   common=zymol   old=1044714 new=469795 55.0%
zymolysis   -> zymoplastic common=zymo    old=1044726 new=469804 55.0%
zymoplastic -> zymoscope   common=zymo    old=1044736 new=469811 55.0%
zymoscope   -> zymurgy     common=zym     old=1044744 new=469817 55.0%
zymurgy     -> zyzzyva     common=zy      old=1044752 new=469824 55.0%
zyzzyva     -> zyzzyvas    common=zyzzyva old=1044761 new=469827 55.0%

To speed lookup, there was a 26-entry table in memory which held the starting offsets for words beginning with a, b, c, ..., z. The words at these offsets always had 0 as the first byte as they had no letters in common with the previous word.

This seems to be sort of a trie but without the pointers, which would surely get space-expensive if every character in the tree had a 4-byte pointer associated with it.

Mind you, this was from my CP/M days where memory was much scarcer than it is now.

like image 151
paxdiablo Avatar answered Nov 08 '22 22:11

paxdiablo


A Patricia trie may be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me there were used in some of the early full-text search engines ...

Paul.

like image 31
Paul W Homer Avatar answered Nov 08 '22 22:11

Paul W Homer


What are you doing? If it's spell checking, you could use a bloom filter - see this code kata.

like image 40
Mike Scott Avatar answered Nov 08 '22 20:11

Mike Scott


You still have to maintain the tree structure itself with Trie. Huffman encoding the alphabet or N-letters (for common forms like "tion", "un", "ing") can take advantage of the occurrence frequency in your dictionary and compress the entry to bits.

like image 44
Eugene Yokota Avatar answered Nov 08 '22 22:11

Eugene Yokota