Is there any way to determine if a string is base64-encoded twice?
For example, is there a regex pattern that I can use with preg_match
to do this?
In base64 encoding, the character set is [A-Z, a-z, 0-9, and + /] . If the rest length is less than 4, the string is padded with '=' characters. ^([A-Za-z0-9+/]{4})* means the string starts with 0 or more base64 groups.
(Theoretical answer.) Double-base-64-encoded strings are regular, because there is a finite amount of byte sequences that properly base64-encode a base64-encoded message. You can check if something is base64-encoded once since you can validate each set of four characters.
When decoding Base64 text, four characters are typically converted back to three bytes. The only exceptions are when padding characters exist. A single = indicates that the four characters will decode to only two bytes, while == indicates that the four characters will decode to only a single byte.
(Practical answer.) Don't use regex. Decode your string using base64_decode()
's optional $strict
parameter set to true
and see if it matches the format you expect. Or simply try and decode it as many times as it permits. E.g.:
function base64_decode_multiple(string $data, int $count = 2) {
while ($count-- > 0 && ($decoded = base64_decode($data, true)) !== false) {
$data = $decoded;
}
return $data;
}
(Theoretical answer.) Double-base-64-encoded strings are regular, because there is a finite amount of byte sequences that properly base64-encode a base64-encoded message.
You can check if something is base64-encoded once since you can validate each set of four characters. The last four bytes in a base64-encoded message may be a special case because =
s are used as padding. Using the regular expression:
<char> := [A-Za-z0-9+/]
<end-char> := [A-Za-z0-9+/=]
<chunk> := <char>{4}
<end-chunk> := <char>{2} <end-char>{2} | <char>{3} <end-char>
<base64-encoded> := <chunk>* <end-chunk>?
You can also determine if something is base64-encoded twice using regular expressions, but the solution is not trivial or pretty, since it's not enough to check 4 bytes at a time.
Example: "QUFBQQ==" base64-decodes to "AAAA" that base64-decodes to three NUL-bytes:
$ echo -n "QUFBQQ==" | base64 -d | xxd
00000000: 4141 4141 AAAA
$ echo -n "AAAA" | base64 -d | xxd
00000000: 0000 00 ...
At this point we could enumerate all double-base64-encodings where the base64-encoding is 4 bytes within the base64 alphabet ("AAAA", "AAAB", "AAAC", "AAAD", etc.) and minimize this:
<ugly 4> := QUFBQQ== | QUFBQg== | QUFBQw== | QUFBRA== | ...
And we could enumerate the first 4 bytes of all double-base64-encodings where the base64-encoding is 8 bytes or longer (cases that don't involve padding with =
) and minimize that:
<chunk 4> := QUFB | QkFB | Q0FB | REFB | ...
One partition (the pretty one) of double-base64-encoded strings will not contain =
s at the end; their lengths are a multiple of 8:
<pretty double-base64-encoded> := <chunk 4>{2}*
Another partition of double-base64-encoded strings will have lengths that are multiples of 4 but not 8 (4, 12, 20, etc.); they can be thought of as pretty ones with an ugly bit at the end:
<ugly double-base64-encoded> := <chunk 4>{2}* <ugly 4>
We could then construct a combined regular expression:
<double-base64-encoded> := <pretty double-base64-encoded>
| <ugly double-base64-encoded>
As I said, you probably don't want to go through all this mess just because double-base64-encoded messages are regular. Just like you don't want to check if an integer is within some finite interval. Also, this is a good example of getting the wrong answer when you should have been asking another question. :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With