Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compression Algorithm for Encoding Word Lists

I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.

I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".

The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.

EDIT: To sum up the requirements of the dictionary:

  • zero false positives
  • zero false negatives
  • very high compression ratio
  • no need for decompression
like image 947
Tall Jeff Avatar asked Jan 01 '09 20:01

Tall Jeff


People also ask

What are the 3 text compression methods?

Finding the best possible model is the real art of data compression. There are three types of models: • static • semiadaptive or semistatic • adaptive.

Which algorithm is used as compression technique?

The LZW algorithm is a very common compression technique.

Which algorithm is best for data compression?

The fastest algorithm, lz4, results in lower compression ratios; xz, which has the highest compression ratio, suffers from a slow compression speed. However, Zstandard, at the default setting, shows substantial improvements in both compression speed and decompression speed, while compressing at the same ratio as zlib.


2 Answers

See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)

See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.

like image 128
Darius Bacon Avatar answered Sep 28 '22 10:09

Darius Bacon


A Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter and http://www.coolsnap.net/kevin/?p=13) is a data structure used to store the dictionary words in a very compactly in some spell checkers. There is, however, a risk for false positives.

like image 39
ahy1 Avatar answered Sep 28 '22 12:09

ahy1