Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does base64-encoded data compress so poorly?

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:

  • Original file: 429,7 MiB
  • compress via xz -9:
    13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
  • base64 it and compress via xz -9:
    26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
  • base64 the original compressed xz file:
    17,8 MiB in almost no time = the expected 1.33x increase in size

So what can be observed is that:

  • xz compresses really good ☺
  • base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
  • base64-then-compress is significantly worse and slower than compress-then-base64

How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).

I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.

like image 646
Stefan Seidel Avatar asked Jun 30 '16 13:06

Stefan Seidel


People also ask

Does Base64 compress well?

base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file.

Is Base64 inefficient?

Base64 is apparently wasteful because we use just 64 different values per byte, whereas a byte can represent 256 different characters. That is, we use bytes (which are 8-bit words) as 6-bit words. There is a waste of 2 bits for each 8 bits of transmission data.

Why is Base64 the most compact encoding?

Whereas Decimal is even less efficient: each byte of data would be represented as three numbers, which means that each byte of unencoded data would take up three bytes as encoded data. Base64 maps every three bytes of data into four bytes of encoded data.

Why is Base64 more compact than binary?

Base64 is more space efficient. Using 4 characters to represent 3 bytes where as hex uses 2 characters for each byte. In other words: hex increases the size of the string with 100%.

Why can't Base64 be used to compress data?

Base64 is already encoded in a way which does not suit most compression algorithms - see Why does base64-encoded data compress so poorly? for details. You will want to compress the original binary data and then base64 the compressed data, or don't bother coverting to base64 atall.

What is Base64 encoding and how does it work?

The trick behind base64 encoding is that we use 64 different ASCII characters including all letters, upper and lower case, and all numbers. Not all non-textual documents are shared online using base64 encoding.

Why would you use Base64 and thengzip?

Why would you use Base64 and thengzip? The point of using Base64 is so that the output only uses a safe subset of ASCII. But gzip will turn that into binary. If you’re going to compress a file, there is no value in using Base64 first.

Why is Base64 a waste of memory?

Base64 is apparently wasteful because we use just 64 different values per byte, whereas a byte can represent 256 different characters. That is, we use bytes (which are 8-bit words) as 6-bit words. There is a waste of 2 bits for each 8 bits of transmission data.


1 Answers

Most generic compression algorithms work with a one-byte granularity.

Let's consider the following string:

"XXXXYYYYXXXXYYYY" 
  • A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
  • A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
  • A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."

Now let's encode our string in Base64. Here's what we get:

"WFhYWFlZWVlYWFhYWVlZWQ==" 

All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.

As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):

Input bytes    : aaaaaaaa bbbbbbbb cccccccc 6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc 

Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):

0         1         2         3         4         5         6 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ 

For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:

Input bytes      : 0x58     0x58     0x58 As binary        : 01011000 01011000 01011000 6-bit repacking  : 00010110 00000101 00100001 00011000 As decimal       : 22       5        33       24 Base64 characters: 'W'      'F'      'h'      'Y' Output bytes     : 0x57     0x46     0x68     0x59 

Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.

Or in other words: we have broken the original byte alignment that most compression algorithms rely on.

Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.

This is even more true for encryption: compress first, encrypt second.

EDIT - A note about LZMA

As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.

Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:

Input bytes      : 0x58     0x58     0x58 As binary        : 01011000 01011000 01011000 (see above for the details of Base64 encoding) Output bytes     : 0x57     0x46     0x68     0x59 As binary        : 01010111 01000110 01101000 01011001 

Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.

like image 173
Arnauld Avatar answered Sep 17 '22 14:09

Arnauld