Compression Algorithm for Encoding Word Lists

Tags:

I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.

I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".

The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.

EDIT: To sum up the requirements of the dictionary:

zero false positives
zero false negatives
very high compression ratio
no need for decompression

947

asked Jan 01 '09 20:01

Tall Jeff

2 Answers

See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)

See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.

128

answered Sep 28 '22 10:09

Darius Bacon

A Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter and http://www.coolsnap.net/kevin/?p=13) is a data structure used to store the dictionary words in a very compactly in some spell checkers. There is, however, a risk for false positives.

answered Sep 28 '22 12:09

ahy1

Related questions
                            
                                Next month, same day in PHP
                            
                                Algorithm to find the smallest snippet from searching a document?
                            
                                Find the pair across 2 arrays with kth largest sum [closed]
                            
                                Largest rectangular sub matrix with the same number
                            
                                Efficient scheduling of university courses
                            
                                What is a Bw-tree?
                            
                                Word frequency in a large text file
                            
                                Is it possible to evaluate lambda calculus terms efficiently?
                            
                                How do I reverse a UTF-8 string in place?
                            
                                How can I check Hamming Weight without converting to binary?
                            
                                Combining MD5 hash values
                            
                                How does bubble sort compare to selection sort?
                            
                                Find word in dictionary of unknown size using only a method to get a word by index
                            
                                Algorithm to find 100 closest stars to the origin
                            
                                Mergesort - Is Bottom-Up faster than Top-Down?
                            
                                find pair of numbers whose difference is an input value 'k' in an unsorted array
                            
                                Block reduction in CUDA
                            
                                Heap's algorithm permutation generator
                            
                                How do you find the space complexity of recursive functions such as this one?
                            
                                How can I negate a functor in C++ (STL)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compression Algorithm for Encoding Word Lists

Tags:

dictionary

algorithm

data-structures

Tall Jeff

People also ask

2 Answers

Darius Bacon

ahy1

Recent Activity

Donate For Us