Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding JSON in UTF-16 or UTF-32

The JSON RFC, section 2.5, says in part:

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

Assume I have a valid reason to encode JSON as UTF-16BE (which is allowed). When doing so, is it still necessary to escape characters that are not in the Basic Multilingual Plane? E.g., instead of this:

00 5C 00 75 00 44 00 38 00 33 00 34 00 5C 00 75 00 44 00 44 00 31 00 45
  \     u     D     8     3     4     \     u     D     D     1     E

which is the 24-byte UTF-16BE byte sequence for \uD834\uDD1E, is it legal to do this:

D8 34 DD 1E

i.e., use the 4-byte UTF-16BE values directly?

Similarly, if I were to encode the same JSON string as UTF-32BE, could I simply use the code-point value directly:

00 01 D1 1E

?

like image 990
Paul J. Lucas Avatar asked Jul 25 '12 02:07

Paul J. Lucas


People also ask

Is JSON a UTF-16?

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

What encoding is used for JSON?

The default encoding is UTF-8. (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used.

Why is UTF-32 rarely used?

The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point, including 11 bits that are always zero. Characters beyond the BMP are relatively rare in most texts (except for e.g. texts with some popular emojis), and can typically be ignored for sizing estimates.

What is the difference between UTF-8 and UTF-16 and UTF-32?

UTF-8 requires 8, 16, 24 or 32 bits (one to four bytes) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.


1 Answers

As far as I can tell, yes, you can write the UTF-16 values directly. Support: the RFC paragraph you quoted explains how to escape arbitrary Unicode if you have decided to escape it. However, earlier in that same section, the RFC says

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence...

(Emphasis added.)

To me, this says that only ", \ and control characters must be escaped, and that any other Unicode characters may be placed as-is directly into the JSON text (in whatever UTF form you are using). It also says to me that even if you're encoding as UTF-8, you don't need to use the \uXXXX form for any Unicode character other than ", \, and control characters.

(As an aside, this does make me wonder whether the \uXXXX form is actually useful for anything other than control characters. As the other poster said, it probably comes down to what your JSON parser actually supports.)

like image 86
Chris Hillery Avatar answered Oct 08 '22 19:10

Chris Hillery