I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.
I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".
The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.
EDIT: To sum up the requirements of the dictionary:
Finding the best possible model is the real art of data compression. There are three types of models: • static • semiadaptive or semistatic • adaptive.
The LZW algorithm is a very common compression technique.
The fastest algorithm, lz4, results in lower compression ratios; xz, which has the highest compression ratio, suffers from a slow compression speed. However, Zstandard, at the default setting, shows substantial improvements in both compression speed and decompression speed, while compressing at the same ratio as zlib.
See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)
See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.
A Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter and http://www.coolsnap.net/kevin/?p=13) is a data structure used to store the dictionary words in a very compactly in some spell checkers. There is, however, a risk for false positives.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With