Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to get collisions with base64 Encoding / Decoding

A similar question was asked here: Is base64 encoding always one to one

And apparently the answer (to the similar question) is YES. I already know that, BUT I'd be curious to know the explanation for why these two strings appear to be equivalent after being Base64 decoded:

cwB0AGQAAG==

cwB0AGQAAA==


One more thing... when you select the de-coded string then recode, both re-encode to the same value: cwB0AGQAAA==

What happened?

like image 270
SlowLearner Avatar asked Nov 09 '18 12:11

SlowLearner


1 Answers

base64 is not one-to-one; there are multiple ways to encode the same bytes. What you're seeing is multiple ways to encode the padding at the end of the string.

base64 encodes bytes (8 bits each) into base 64. A character in base64 encodes 6 bits, so four base64 characters can handle three bytes. When the length of the input is not a multiple of three bytes, base64 uses = as a padding character to fill up the last group of four base64 characters. XXX= indicates that only the first two bytes of the group are to be used (where XXX represents three arbitrary base64 characters), while XX== indicates that only the first byte should be used.

The last group in your example is AA==, which encodes a 0 byte. However, the AA part can encode 12 bits, of which the least significant four are ignored on decoding, so you can use any character from A-P and get the same result. When you use the encoder it always picks zeros for those four bits, so you get back AA==.

Padding is actually even more complicated in base64. Technically you can exclude the = characters; the length of the string will indicate their absence (according to Wikipedia, not all decoders support this). Where padding is useful is that it allows base64 strings to be safely concatenated, since every group of four is interpreted the same way. However, this means that padding can also appear in the middle of a string, which means a sequence of bytes can be encoded in all sorts of ways. You can also include whitespace or newlines, which are all ignored.

Despite all of this, base64 is still injective, meaning if x != y, then base64(x) != base64(y); as a result, you cannot get collisions and can always get the original data back. However, base64 is not surjective: there are many ways of encoding the same data.

like image 54
Tej Chajed Avatar answered Oct 16 '22 00:10

Tej Chajed