Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Zlib compression is better on string vs binary data?

Say I have a .txt file like this:

11111111111111Hello and welcome to stackoverflow. stackoverflow will hopefully provide me with answers to answers i do not know. Hello and goodbye.11111111111111

Then I would have an equivalent in binary form (.bin file) created as such:

Stream.Write(intBytes, 0, intBytes.Length); // 11111111111111
Stream.Write(junkText, 0, junkText.Length); // Hello and welcome to stackoverflow...
Stream.Write(intBytes, 0, intBytes.Length); // 11111111111111

The first example compresses better than the second. If i removed the 11111111111111 they compress to the same size. But having the 11111's means the .txt version compresses better.

byte[] intBytes = BitConverter.GetBytes(11111111111111); // This is 8 bytes
byte[] strBytes = UTF8Encoding.UTF8.GetBytes("11111111111111"); // This is 14 bytes

This is using the native C++ Zlib library.

Before compression the .bin file is lesser in size and I was expecting this.

Why is it that after compression the .txt version is lesser in size? It seems it compresses that better than the bin equivalent.

bin file: Uncompressed Size:2448 Compressed Size:177

txt file: Uncompressed Size:2460 Compressed Size:167

like image 229
Science_Fiction Avatar asked Jun 28 '26 22:06

Science_Fiction


1 Answers

So a bigger file compresses to a smaller file. There are two explanations that I can offer:

  1. Compression works when the input has low entropy. Try to compress random data of 180 bytes and the compressed size will be even larger than the best of your test cases. Prepending binary ones means that the compressor has to deal with binary data and text at the same time. New byte values are introduced that do not occur at all in text. This increases entropy of the file.
  2. All compression have weak and strong spots (except for perfect "Kolmogorov"-compression). You might be seeing an anomaly caused by some implementation detail. The difference is not big after all.
like image 176
usr Avatar answered Jul 01 '26 21:07

usr