Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting to Base64 in JavaScript without Deprecated 'Escape' call

My name is Festus.

I need to convert strings to and from Base64 in a browser via JavaScript. The topic is covered quite well on this site and on Mozilla, and the suggested solution seems to be along these lines:

function toBase64(str) {
    return window.btoa(unescape(encodeURIComponent(str)));
}

function fromBase64(str) {
    return decodeURIComponent(escape(window.atob(str)));
}

I did a bit more research and found out that escape() and unescape() are deprecated and should no longer be used. With that in mind, I tried removing calls to the deprecated functions which yields:

function toBase64(str) {
    return window.btoa(encodeURIComponent(str));
}

function fromBase64(str) {
    return decodeURIComponent(window.atob(str));
}

This seems to work but it begs the following questions:

(1) Why did the originally proposed solution include calls to escape() and unescape()? The solution was proposed prior to deprecation but presumably these functions added some kind of value at the time.

(2) Are there certain edge cases where my removal of these deprecated calls will cause my wrapper functions to fail?

NOTE: There are other, far more verbose and complex solutions on StackOverflow to the problem of string=>Base64 conversion. I'm sure they work just fine but my question is specifically related to this particular popular solution.

Thanks,

Festus

like image 217
Festus Martingale Avatar asked Jun 03 '15 22:06

Festus Martingale


1 Answers

TL;DR / Short Summary

Don't use btoa(encodeURIComponent(str)) and decodeURIComponent(atob(str)) - that's “nonsense”.

convert string to Base64” usually means “encode string as UTF-8 and encode the bytes as Base64”, and that's exactly what btoa(unescape(encodeURIComponent(str))) does. btoa(encodeURIComponent(str)) is doing something else that isn't useful for any case I can imagine, even though it never throws an error as explained in humanityANDpeaces detailed answer.



What does “convert string to Base64” mean?

Base64 is a binary-to-text encoding, a sequence of bytes is encoded as a sequence of ASCII characters.1 It is therefore not possible to directly encode text as Base64. It is conceptually always a two step procedure:

  1. convert string to bytes (using some character encoding)
  2. encode bytes as Base64

You can principally use any character encoding (also called charset2 or Encoding Scheme) you want, it just needs to be able to represent all needed characters and it has to be the same for both directions (text to Base64 and Base64 to text). As there are many different character encodings, the protocol or API should define which one is used. If an API expects a "string encoded via Base64" and doesn't mention the character encoding, you can nowadays usually assume, that UTF-8 encoding is expected.3

Base64-encoding the bytes from step 1 is pretty straightforward:
a) Take three input bytes to get 24 bits.
b) Split into four chunks of 6 bits each, to get four numbers in range 0...63.
c) Translate numbers to ASCII chars via table and add these chars to the output
d) Goto a)
More details about Base64 itself are out of the scope of this answer.

What does btoa do?

By now you might think: “This answer can't possibly be correct. It claims, that it is not possible to directly encode text as Base64, even though this is exactly what btoa does - it takes text and spits out Base64.

No. It does not take text and returns Base64, it takes an argument of type string and returns Base64. But that string argument doesn't represent text, it is just a strange way to store a sequence of bytes. Each byte is represented by a character whose numerical code point value is equal to the value of the byte.4

A Note in the HTML standard says, that “the "b" can be considered to stand for "binary", and the "a" for "ASCII". ” Contrary to popular opinion, I don't think, that btoa is named badly. It does not take text, it takes binary data and produces an ASCII string using Base64, so a short form of “binary to ascii” is an absolutely correct name. It's the argument type, that is misleading.

The definition of btoa in the HTML standard simply says:

[...] the user agent must convert that argument to a sequence of octets whose nth octet is the eight-bit representation of the code point of the nth character of the argument, and then must apply the base64 algorithm to that sequence of octets, and return the result.

I don't know and probably will never know, why they didn't chose a different argument type e.g. an array of numbers. Maybe the performance wasn't as good at the time when btoa was first specified?

What does unescape(encodeURIComponent(str)) do?

By now you could think: “If the first step in converting text to Base64 is encoding the text to bytes, then how is btoa(unescape(encodeURIComponent(str))) achieving that? btoa doesn't do that, but neither unescape nor encodeURIComponent seem to be in any way related to character encoding?

Actually, encodeURIComponent is related to character encoding. The standard says:

The encodeURIComponent function computes a new [...] URI in which each instance of certain code points is replaced by [...] escape sequences representing the UTF-8 encoding of the code point.

So now we have the percent-encoded UTF-8 bytes. To convert the percent-encoded bytes to a binary string suitable for btoa, one can use unescape, because the behavior description states among other things:

  • If c is the code unit 0x0025 (PERCENT SIGN), then
    • [... how to decode %uXXXX ...]
    • Else if k ≤ length - 3 and [... two hexdigits follow ...] then
      • Set c to the code unit whose value is the integer represented by [...] the two hexadecimal digits at indices k + 1 and k + 2 within string.

Therefore after encodeURIComponent stored the UTF-8 bytes as %XX, unescape turns them into single codepoints exactly as required by btoa. So all in all btoa(unescape(encodeURIComponent(str))) encodes text to UTF-8 bytes which are then encoded to Base64.

Back to the original question

In case you forgot, the question was:

(1) Why did the originally proposed solution include calls to escape() and unescape()? The solution was proposed prior to deprecation but presumably these functions added some kind of value at the time.

(2) Are there certain edge cases where my removal of these deprecated calls will cause my wrapper functions to fail?

Without unescape you don't get a Base64 representation of a UTF-8 encoded string. btoa(encodeURIComponent(str)) encodes text to some strange bytes (not a standardized Unicode Encoding Scheme, but the bytes one can get by storing an URI-encoded string as ASCII) which are then encoded as Base64. So unescape is necessary for standard conformance -- OK, encodeURIComponent and ASCII are also standardized, but nobody will expect that strange combination.

If only you yourself are converting to and from Base64, then yes you could use btoa(encodeURIComponent(str)) and it will never throw an error as explained in humanityANDpeaces detailed answer (Question (2) is sufficiently answered I think).

But in that case you could much better just use the result of encodeURIComponent directly. It already is pure ASCII and is always shorter than btoa(encodeURIComponent(str)). If you want smaller size than encodeURIComponent(str) you can use btoa(unescape(encodeURIComponent(str))) (smaller if input string contains more non-ASCII chars).

If you convert to Base64, because some other party, API or protocol expects Base64, then you simply can not use btoa(encodeURIComponent(str)), because nobody understands the result.

Oh, and btoa(unescape(encodeURIComponent(str))) couldn't really be “proposed prior to deprecation” of unescape:
unescape was removed from the standard in version 3, the same version that added encodeURIComponent. unescape was still explained in the document, but was moved to Annex B.2, whose introduction stated, that it “suggests uniform semantics [...] without making the properties or their semantics part of this standard.” But as browsers have to be backwards compatible, it probably won't be removed any time soon.


Try for yourself:

function run(){
    let Base64Function=new Function("str", $("#algorithm").val());
    let base64=Base64Function($("#input").val());
    $("#Base64Text").text("Output: "+base64);
    let charset=$('#charset').val();
    let uri="data:text/plain"
           +(charset?";charset="+charset:'')
           +($("#interpret").prop('checked')?";base64":'')
           +","+base64;
    $("#dataURI").text(uri);
    $("#dataURI").attr('href', uri);
    $("#Base64iframe").attr('src',uri);
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

<label for="input">Text to encode:</label>
<input type="text" id="input" value="abc€😀"/><br />

<label for="algorithm">Encode function:</label>
<input type="text" id="algorithm" size="50"/><br />

<button type="button" onclick="run();">Run</button>
Defaults:
<button type="button" onclick='
    $("#algorithm").val("return btoa(unescape(encodeURIComponent(str)))");
    $("#charset").val("UTF-8");
    $("#interpret").prop("checked",true);
'>UTF-8 Base64</button>
<button type="button" onclick='
    $("#algorithm").val("return btoa(encodeURIComponent(str))");
    $("#charset").val(""); //I don't know - it's not UTF-8
    $("#interpret").prop("checked",true);
'>wrong</button>
<button type="button" onclick='
    $("#algorithm").val("return encodeURIComponent(str)");
    $("#charset").val("UTF-8");
    $("#interpret").prop("checked",false);
'>without btoa (not Base64)</button>
<br />

<div id="Base64Text">Output:</div>

<label for="charset">Interpret as this character encoding:</label>
<input type="text" id="charset" /><br />

<label for="interpret">Interpret as Base64:</label>
<input type="checkbox" id="interpret" /><br />

<div><a id="dataURI"></a></div>
<iframe id="Base64iframe"></iframe>

This snippet tests the Base64 result by creating a dataURI, but the concept applies to other applications of Base64 as well.


Note:

In some quotations I use [ and ] to leave out or shorten things that are unimportant in my opinion.
[... some text ...] is obviously not part of the source.

Footnotes:

1 The standard says that Base64 “is designed to represent arbitrary sequences of octets” (octet means byte consisting of eight bits)

2 A character set is not exactly the same as a character encoding. However a coded character set can always be considered to implicitly define a character encoding, therefore "character set" and "character encoding" are often used as synonyms. Maybe it once was the same? Sometimes the term charset is explicitly used as a short term for character encoding and not for character set.

3 At least UTF-8 is very dominant for websites. Also see UTF-8 Everywhere

4 This is effectively the ISO_8859-1 encoding, but I wouldn't think of it this way. Better think bytes[i]==str.charCodeAt(i).

like image 138
T S Avatar answered Sep 16 '22 12:09

T S