Space-Efficient Data Structure for Storing a Word List?

Tags:

data-structures

Is there anything better than a Trie for this situation?

Storing a list of ~100k English words
Needs to use minimal memory
Lookups need to be reasonable, but don't have to be lightning fast

I'm working with Java, so my first attempt was to just use a Set<String>. However, I'm targeting a mobile device and already running low on memory. Since many English words share common prefixes, a trie seems like a decent bet to save some memory -- anyone know some other good options?

EDIT - More info - The data structure will be used for two operations

Answering: Is some word XYZ in the list?
Generating the neighborhood of words around XYZ with one letter different

Thanks for the good suggestions

551

asked Dec 11 '08 01:12

allclaws

4 Answers

One structure I saw for minimizing space in a spelling dictionary was to encode each word as:

the number of characters (a byte) in common with the last; and
the new ending.

So the word list

HERE            would encode as    THIS
sanctimonious                      0,sanctimonious
sanction                           6,on
sanguine                           3,guine
trivial                            0,trivial

You're saving 7 bytes straight up there (19%), I suspect the saving would be similar for a 20,000 word dictionary just due to the minimum distances between (common prefixes of) adjacent words.

For example, I ran a test program over a sorted dictionary and calculated old storage space as the word length plus one (for terminator), and the new storage space as one for the common length, the uncommon suffix length and one for a terminator. Here's the final part of that test program showing that you could well over 50%:

zwiebacks   -> zygote      common=        old=1044662 new=469762 55.0%
zygote      -> zygotes     common=zygote  old=1044670 new=469765 55.0%
zygotes     -> zygotic     common=zygot   old=1044678 new=469769 55.0%
zygotic     -> zymase      common=zy      old=1044685 new=469775 55.0%
zymase      -> zymogenic   common=zym     old=1044695 new=469783 55.0%
zymogenic   -> zymology    common=zymo    old=1044704 new=469789 55.0%
zymology    -> zymolysis   common=zymol   old=1044714 new=469795 55.0%
zymolysis   -> zymoplastic common=zymo    old=1044726 new=469804 55.0%
zymoplastic -> zymoscope   common=zymo    old=1044736 new=469811 55.0%
zymoscope   -> zymurgy     common=zym     old=1044744 new=469817 55.0%
zymurgy     -> zyzzyva     common=zy      old=1044752 new=469824 55.0%
zyzzyva     -> zyzzyvas    common=zyzzyva old=1044761 new=469827 55.0%

To speed lookup, there was a 26-entry table in memory which held the starting offsets for words beginning with a, b, c, ..., z. The words at these offsets always had 0 as the first byte as they had no letters in common with the previous word.

This seems to be sort of a trie but without the pointers, which would surely get space-expensive if every character in the tree had a 4-byte pointer associated with it.

Mind you, this was from my CP/M days where memory was much scarcer than it is now.

151

answered Nov 08 '22 22:11

paxdiablo

A Patricia trie may be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me there were used in some of the early full-text search engines ...

Paul.

answered Nov 08 '22 22:11

Related questions
                            
                                How to define margin of CardView
                            
                                Type casting in Java from char to short [duplicate]
                            
                                How to print java method call stack?
                            
                                How to pass topics dynamically to a kafka listener?
                            
                                Why does this code using streams run so much faster in Java 9 than Java 8?
                            
                                Outputting elements of an array with Arrays.toString()
                            
                                Why do I get Java.lang.IllegalArgumentException: wrong number of arguments while invoking a method with varargs using Reflection [duplicate]
                            
                                URL must start with 'jdbc' in sprinboot
                            
                                In the context of SpringMVC, how to have web services that provide different JSON representation of a same class?
                            
                                How is String a reference type in Java?
                            
                                BigDecimal summary statistics
                            
                                Why is the size of this set 1 after adding 5 objects?
                            
                                Java 9 ifPresentOrElse returning value
                            
                                How to provide a custom deserializer with Jackson and Spring Boot
                            
                                JUnit testing an asynchronous method with Mockito
                            
                                How to get all objects having the same max value with Java streams? [duplicate]
                            
                                io.jsonwebtoken.UnsupportedJwtException: Signed Claims JWSs are not supported
                            
                                How Would You Programmatically Create a Pattern from a Date that is Stored in a String?
                            
                                Too many "pattern suffixes" - design smell?
                            
                                Java 1.6 SDK on Mac OS 10.5

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Space-Efficient Data Structure for Storing a Word List?

Tags:

java

data-structures

allclaws

People also ask

4 Answers

paxdiablo

Paul W Homer

Mike Scott

Eugene Yokota

Recent Activity

Donate For Us