Why does base64 encoding require padding if the input length is not divisible by 3?

Edit: An Illustration

Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.

I encodes to SQ (SQ== with padding)
AM encodes to QU0 (QU0= with padding)
TJM encodes to VEpN (VEpN with padding)

So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.

Why Bother with Padding?

Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.

That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.

If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.

Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.

On a related note, here's a base converter for arbitrary base conversion I created for you. Enjoy! https://convert.zamicol.com/

What are Padding Characters?

Padding characters help satisfy length requirements and carry no meaning.

Decimal Example of Padding: Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".

Binary Encoding

The Byte Paradigm: The byte is the de facto standard unit of measurement and any encoding scheme must relate back to bytes.

Base256 fits exactly into this paradigm. One byte is equal to one character in base256.

Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.

Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.

We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.

This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.

So why is padding encouraged even though encoding could work just fine without the padding characters?

If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.

As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.

And this is exactly what the base64 RFC says,

In some circumstances, the use of padding ("=") in base-encoded data is not required or used. In the general case, when assumptions about the size of transported data cannot be made, padding is required to yield correct decoded data.

[...]

The padding step in base 64 [...] if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders, which is described in the descriptions on padding below. If this property do not hold, there is no canonical representation of base-encoded data, and multiple base- encoded strings can be decoded to the same binary data. If this property (and others discussed in this document) holds, a canonical encoding is guaranteed.

Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.

Examples

Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)

Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.

BASE64("")       = ""           (No bytes used. 0%3=0.)
BASE64("f")      = "Zg=="       (One byte used. 1%3=1.)
BASE64("fo")     = "Zm8="       (Two bytes. 2%3=2.)
BASE64("foo")    = "Zm9v"       (Three bytes. 3%3=0.)
BASE64("foob")   = "Zm9vYg=="   (Four bytes. 4%3=1.)
BASE64("fooba")  = "Zm9vYmE="   (Five bytes. 5%3=2.)
BASE64("foobar") = "Zm9vYmFy"   (Six bytes. 6%3=0.)

Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp

There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.

Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.

This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:

A full encoding quantum is always completed at the end of a message.

It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.

Related questions
                            
                                Set Encoding of File to UTF8 With BOM in Sublime Text 3
                            
                                Best way to encode text data for XML in Java?
                            
                                Content Transfer Encoding 7bit or 8 bit
                            
                                SyntaxError of Non-ASCII character [duplicate]
                            
                                Set encoding and fileencoding to utf-8 in Vim
                            
                                C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H
                            
                                Android. WebView and loadData
                            
                                How to set standard encoding in Visual Studio
                            
                                java.sql.SQLException: Incorrect string value: '\xF0\x9F\x91\xBD\xF0\x9F...'
                            
                                Character reading from file in Python
                            
                                ruby 1.9: invalid byte sequence in UTF-8
                            
                                Replacement for stringByAddingPercentEscapesUsingEncoding in ios9?
                            
                                Write text files without Byte Order Mark (BOM)?
                            
                                Why declare unicode by string in python?
                            
                                "unmappable character for encoding" warning in Java
                            
                                Should I use encoding declaration in Python 3?
                            
                                How can I safely encode a string in Java to use as a filename?
                            
                                How to get ASCII value of string in C#
                            
                                Base64 encoding in SQL Server 2005 T-SQL
                            
                                Java FileReader encoding issue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does base64 encoding require padding if the input length is not divisible by 3?

Tags:

padding

encoding

base64

decoding

People also ask