Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

ECMA-404: The JSON Data Interchange Format

I believe that there is no need to encode this character at all, so it could be represented directly as "๐„ž". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?

like image 570
TRiG Avatar asked Jul 19 '16 15:07

TRiG


People also ask

Does JSON support UTF-16?

(in ยง6) JSON may be represented using UTF-8, UTF-16, or UTF-32.

What is surrogate encoding?

The surrogate code points are used in UTF-16 to represent code points beyond FFFF . They are used in pairs, so a character is made of 4 bytes. This mechanism is not needed in UTF-8, so text encoded with UTF-8 shouldn't contain them.

What is surrogate pair?

With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over one million additional characters.

What is surrogate pairs Javascript?

Surrogate pair is a representation for a single abstract character that consists of a sequence of code units of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit. An astral code point requires two code units โ€” a surrogate pair.


1 Answers

One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)

But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

like image 61
nwellnhof Avatar answered Oct 21 '22 15:10

nwellnhof