In honor of the Hutter Prize, what are the top algorithms (and a quick description of each) for text compression? Note: The intent of this question is to get a description of compression algorithms, not of compression programs.

The boundary-pushing compressors combine algorithms for insane results. Common algorithms include: <ul> <li>The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.</li> <li> Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.</li> <li>Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.</li> <li> Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.</li> <li>In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.</li> </ul> Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.

There's always lzip. All kidding aside: <ul> <li>Where compatibility is a concern, PKZIP (<code>DEFLATE</code> algorithm) still wins.</li> <li>bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.</li> <li> 7-Zip (<code>LZMA</code> algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.</li> <li> rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.</li> </ul>

What is the current state of text-only compression algorithms?

2 Answers

The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:

The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.

Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.

answered Sep 17 '22 12:09

Corbin March

There's always lzip.

All kidding aside:

Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.

answered Sep 20 '22 12:09

Sören Kuklau

Related questions
                            
                                Notice: ob_end_flush(): failed to send buffer of zlib output compression (1) in
                            
                                iPhone Unzip code
                            
                                zip/compress a folder full of files on android
                            
                                Compress large Integers into smallest possible string
                            
                                Decompress bz2 files
                            
                                Latitude/Longitude storage and compression in C
                            
                                Python - mechanism to identify compressed file type and uncompress
                            
                                Storing compressed json data in local storage
                            
                                Can a JPEG compressed image be rotated without a loss in quality?
                            
                                Compressed Json Javascript [closed]
                            
                                What is the best compression algorithm that allows random reads/writes in a file?
                            
                                Compress file on S3
                            
                                What are the real-world applications of huffman coding?
                            
                                Export to CSV and Compress with GZIP in postgres
                            
                                Compress camera image before upload
                            
                                How can I determine the length (i.e. duration) of a .wav file in C#?
                            
                                Installed .Net 4.5 but can't use ZipFile class in Visual C#
                            
                                Tool to reverse Javascript minify? [duplicate]
                            
                                Request content decompression in ASP.Net Core
                            
                                Removing extra whitespace from generated HTML in MVC

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the current state of text-only compression algorithms?

Tags:

compression

lossless-compression

text-compression

Brian R. Bondy

People also ask

2 Answers

Corbin March

Sören Kuklau

Recent Activity

Donate For Us