To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E"
.
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "๐"
. However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E"
, not (as would seem reasonable) as "\u1d11e"
. Why is this?
(in ยง6) JSON may be represented using UTF-8, UTF-16, or UTF-32.
The surrogate code points are used in UTF-16 to represent code points beyond FFFF . They are used in pairs, so a character is made of 4 bytes. This mechanism is not needed in UTF-8, so text encoded with UTF-8 shouldn't contain them.
With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over one million additional characters.
Surrogate pair is a representation for a single abstract character that consists of a sequence of code units of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit. An astral code point requires two code units โ a surrogate pair.
One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval
function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...}
syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With