Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is base128 not used? [closed]

The problem is that at least 32 characters of the ASCII character set are 'control characters' which may be interpreted by the receiving terminal. E.g., there's the BEL (bell) character that makes the receiving terminal chime. There's the SOT (Start Of Transmission) and EOT (End Of Transmission) characters which performs exactly what their names imply. And don't forget the characters CR and LF, which may have special meanings in how data structures are serialized/flattened into a stream.

Adobe created the Base85 encoding to use more characters in the ASCII character set, but AFAIK it's protected by patents.


Because some of those 128 characters are unprintable (mainly those that is below codepoint 0x20). Therefore, they can't reliably be transmitted as a string over the wire. And, if you go above codepoint 128, you can have encoding issues because of different encodings used across systems.


As already stated in the other answers, the key point is to reduce the character set to the printable ones. A more efficient encoding scheme is basE91 because it uses a larger character set and still avoids control/whitespace characters in the low ASCII range. The webpage contains a nice comparison of binary vs. base64 vs. basE91 encoding efficiency.

I once cleaned up the Java implementation. If people are interested I could push it on GitHub.

Update: It's now on GitHub.


That the first 32 characters are control character has absolutely no relevance, because you don't have to use them to get 128 characters. We have 256 characters to choose from, and only the first 32 are control characters. That leaves 192 characters, and therefore 128 is completely possible without using control characters.

Here is the reason: It has to be something that will look the same, and that you can copy and paste, no matter where. Therefor it has to be characters that will be displayed the same on any forum, chat, email and so on. That means that we can't use characters, that a forum/chat/email clients may typically use for formatting or disregard. It also has to be characters that are the same, regardless of font, language and regional settings.

That is the reason!


Base64 is common because it solves a variety of issues (works nearly everywhere you can think of)

  • You don't need to worry whether the transport is 8-bit clean or not.

  • All the characters in the encoding are printable. You can see them. You can copy and paste them. You can use them in URLs (particular variants). etc.

  • Fixed encoding size. You know that m bytes can always encode to n bytes.

  • Everyone has heard of it - it's widely supported, lots of libraries, so easy to interoperate with.

Base128 doesn't have all those advantages.

It looks like it's 8-bit clean - but recall that base64 uses 65 symbols. Without an out-of-band character you can't have the benefits of a fixed encoding size. If you use an out-of-band character, you can't be 8-bit clean anymore.

It's not all negative though.

  • base128 is easier to encode/decode than base64 - you just use shifts and masks. Can be important for embedded implementations

  • base128 makes slightly more efficient use of the transport than base64 by using more of the available bits.

People do use base128 - I'm using it for something now. It's just not as common.