I want to write a storage backend to store larger chunks of data. The data can be anything, but it is mainly binary files (images, pdfs, jar files) or text files (xml, jsp, js, html, java...). I found most of the data is already compressed. If everything is compressed, about 15% disk space can be saved. I am looking for the most efficient algorithm that can predict with high probability that a chunk of data (let's say 128 KB) can be compressed or not (lossless compression), without having to look at all the data if possible. The compression algorithm will be either LZF, Deflate, or something similar (maybe Google Snappy). So predicting if data is compressible should be much faster than compressing the data itself, and use less memory. Algorithms I already know about: <ul> <li>Try to compress a subset of the data, let's say 128 bytes (this is a bit slow)</li> <li>Calculate the sum of 128 bytes, and if it's within a certain range then it's likely not compressible (within 10% of 128 * 127) (this is fast, and relatively good, but I'm looking for something more reliable, because the algorithm really only looks at the topmost bits for each byte)</li> <li>Look at the file headers (relatively reliable, but feels like cheating)</li> </ul> I guess the general idea is that I need an algorithm that can quickly calculate if the probability of each bit in a list of bytes is roughly 0.5. <h3>Update</h3> I have implemented 'ASCII checking', 'entropy calculation', and 'simplified compression', and all give good results. I want to refine the algorithms, and now my idea is to not only predict if data can be compressed, but also how much it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers... I will accept the answer that gave the best results. Additional answers (new ideas) are still welcome! If possible, with source code or links :-) <h3>Update 2</h3> A similar method is now implemented in Linux.

I implemented a few methods to test if data is compressible. Simplified Compression This basically checks for duplicate byte pairs: <pre class="prettyprint"><code>static boolean isCompressible(byte[] data, int len) { int result = 0; // check in blocks of 256 bytes, // and sum up how compressible each block is for (int start = 0; start < len; start += 256) { result += matches(data, start, Math.min(start + 255, len)); } // the result is proportional to the number of // bytes that can be saved // if we can save many bytes, then it is compressible return ((len - result) * 777) < len * 100; } static int matches(byte[] data, int i, int end) { // bitArray is a bloom filter of seen byte pairs // match counts duplicate byte pairs // last is the last seen byte int bitArray = 0, match = 0, last = 0; if (i < 0 || end > data.length) { // this check may allow the JVM to avoid // array bound checks in the following loop throw new ArrayIndexOutOfBoundsException(); } for (; i < end; i++) { int x = data[i]; // the bloom filter bit to set int bit = 1 << ((last ^ x) & 31); // if it was already set, increment match // (without using a branch, as branches are slow) match -= (-(bitArray & bit)) >> 31; bitArray |= bit; last = x; } return match; } </code></pre> On my (limited) set of test data, this algorithm is quite accurate. It about 5 times faster than compressing itself if the data is not compressible. For trivial data (all zeroes), it is about half as fast however. Partial Entropy This algorithm estimates the entropy of the high nibbles. I wanted to avoid using too many buckets, because they have to be zeroed out each time (which is slow if the blocks to check are small). <code>63 - numberOfLeadingZeros</code> is the logarithm (I wanted to avoid using floating point numbers). Depending on the data, it is faster or slower than the algorithm above (not sure why). The result isn't quite as accurate as the algorithm above, possibly because of using only 16 buckets, and only integer arithmetic. <pre class="prettyprint"><code>static boolean isCompressible(byte[] data, int len) { // the number of bytes with // high nibble 0, 1,.., 15 int[] sum = new int[16]; for (int i = 0; i < len; i++) { int x = (data[i] & 255) >> 4; sum[x]++; } // see wikipedia to understand this formula :-) int r = 0; for (int x : sum) { long v = ((long) x << 32) / len; r += 63 - Long.numberOfLeadingZeros(v + 1); } return len * r < 438 * len; } </code></pre>

How to efficiently predict if data is compressible

Tags:

I want to write a storage backend to store larger chunks of data. The data can be anything, but it is mainly binary files (images, pdfs, jar files) or text files (xml, jsp, js, html, java...). I found most of the data is already compressed. If everything is compressed, about 15% disk space can be saved.

I am looking for the most efficient algorithm that can predict with high probability that a chunk of data (let's say 128 KB) can be compressed or not (lossless compression), without having to look at all the data if possible.

The compression algorithm will be either LZF, Deflate, or something similar (maybe Google Snappy). So predicting if data is compressible should be much faster than compressing the data itself, and use less memory.

Algorithms I already know about:

Try to compress a subset of the data, let's say 128 bytes (this is a bit slow)
Calculate the sum of 128 bytes, and if it's within a certain range then it's likely not compressible (within 10% of 128 * 127) (this is fast, and relatively good, but I'm looking for something more reliable, because the algorithm really only looks at the topmost bits for each byte)
Look at the file headers (relatively reliable, but feels like cheating)

I guess the general idea is that I need an algorithm that can quickly calculate if the probability of each bit in a list of bytes is roughly 0.5.

Update

I have implemented 'ASCII checking', 'entropy calculation', and 'simplified compression', and all give good results. I want to refine the algorithms, and now my idea is to not only predict if data can be compressed, but also how much it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers... I will accept the answer that gave the best results.

Additional answers (new ideas) are still welcome! If possible, with source code or links :-)

Update 2

A similar method is now implemented in Linux.

200

asked Aug 11 '11 13:08

Thomas Mueller

2 Answers

Calculate the entropy of the data. If it has high entropy (~1.0), it is not likely going to be further compressed. If it has low entropy (~0.0), then that means that there isn't a lot of "information" in it and can be further compressed.

It provides a theoretical measure of how compressed a piece of data can get.

answered Sep 28 '22 09:09

tskuzzy

I implemented a few methods to test if data is compressible.

Simplified Compression

This basically checks for duplicate byte pairs:

static boolean isCompressible(byte[] data, int len) {     int result = 0;     // check in blocks of 256 bytes,      // and sum up how compressible each block is     for (int start = 0; start < len; start += 256) {         result += matches(data, start, Math.min(start + 255, len));     }     // the result is proportional to the number of      // bytes that can be saved     // if we can save many bytes, then it is compressible     return ((len - result) * 777) < len * 100; }  static int matches(byte[] data, int i, int end) {     // bitArray is a bloom filter of seen byte pairs     // match counts duplicate byte pairs     // last is the last seen byte     int bitArray = 0, match = 0, last = 0;     if (i < 0 || end > data.length) {         // this check may allow the JVM to avoid         // array bound checks in the following loop         throw new ArrayIndexOutOfBoundsException();     }     for (; i < end; i++) {         int x = data[i];         // the bloom filter bit to set         int bit = 1 << ((last ^ x) & 31);         // if it was already set, increment match         // (without using a branch, as branches are slow)         match -= (-(bitArray & bit)) >> 31;         bitArray |= bit;         last = x;     }     return match; }

On my (limited) set of test data, this algorithm is quite accurate. It about 5 times faster than compressing itself if the data is not compressible. For trivial data (all zeroes), it is about half as fast however.

Partial Entropy

This algorithm estimates the entropy of the high nibbles. I wanted to avoid using too many buckets, because they have to be zeroed out each time (which is slow if the blocks to check are small). 63 - numberOfLeadingZeros is the logarithm (I wanted to avoid using floating point numbers). Depending on the data, it is faster or slower than the algorithm above (not sure why). The result isn't quite as accurate as the algorithm above, possibly because of using only 16 buckets, and only integer arithmetic.

static boolean isCompressible(byte[] data, int len) {     // the number of bytes with      // high nibble 0, 1,.., 15     int[] sum = new int[16];     for (int i = 0; i < len; i++) {         int x = (data[i] & 255) >> 4;         sum[x]++;     }     // see wikipedia to understand this formula :-)     int r = 0;     for (int x : sum) {         long v = ((long) x << 32) / len;         r += 63 - Long.numberOfLeadingZeros(v + 1);     }     return len * r < 438 * len; }

answered Sep 28 '22 11:09

Thomas Mueller

Related questions
                            
                                What's the point in using "is" followed by "as" instead of "as" followed by a null check in C#?
                            
                                Examples of XSS that I can use to test my page input?
                            
                                How to clear Linkedin Share cache?
                            
                                How can I select multiple lines that are not in a continuous chunk?
                            
                                Dynamic Generic Typing in Java
                            
                                private template functions
                            
                                Why does "typdef struct { struct S *s; } S;" containing a pointer to same type compile?
                            
                                How to access one of multiple traits of superclass?
                            
                                Simple git post-commit hook to copy committed files to a certain folder
                            
                                Protocol Buffer: Enum issue
                            
                                Javascript : is given function empty?
                            
                                How interfacing with graphics card work with C or C++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently predict if data is compressible

Tags:

Update

Update 2

Thomas Mueller

People also ask

2 Answers

tskuzzy

Thomas Mueller

Recent Activity

Donate For Us