Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Control characters in JSON string

The JSON specification states that control characters that must be escaped are only with codes from U+0000 to U+001F:

7.  Strings

   The representation of strings is similar to conventions used in the C
   family of programming languages.  A string begins and ends with
   quotation marks.  All Unicode characters may be placed within the
   quotation marks, except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

Main idea of escaping is to don't damage output when printing JSON document or message on terminal or paper.

But there other control characters like [DEL] from C0 and other control characters from C1 set (U+0080 through U+009F). Shouldn't be they also escaped in JSON strings?

like image 468
Andriy Plokhotnyuk Avatar asked Dec 02 '17 12:12

Andriy Plokhotnyuk


1 Answers

From the JSON specification:

8.  String and Character Issues

8.1.  Character Encoding

   JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.

In UTF-8, all codepoints above 127 are encoded in multiple bytes. About half of those bytes are in the C1 control character range. So in order to avoid having those bytes in a UTF-8 encoded JSON string, all of those code points would need to be escaped. This effectively eliminates the use of UTF-8 and the JSON string might as well be encoded in ASCII. As ASCII is a subset of UTF-8 this is not disallowed by the standard. So if you are concerned with putting C1 control characters in the byte stream just escape them, but requiring every JSON representation to use ASCII would be wildly inefficient in anything but an english environment.

UTF-16 and UTF-32 could not possibly be parsed by something that uses the C1 (or even C0) control characters so the point is rather moot for those encodings.

like image 80
Rick Avatar answered Oct 06 '22 15:10

Rick