The json spec allows for escaped unicode in json strings (of the form \uXXXX). It specifically mentions a restricted codepoint (a noncharacter) as a valid escaped codepoint. Doesn't this imply parsers should generate illegal unicode from strings containing noncharacters and restricted codepoints?
An example:
{ "key": "\uFDD0" }
decoding this either requires your parser makes no attempt to interpret the escaped codepoint or that it generates an invalid unicode string. does it not?
The JSON specification states that JSON strings can contain unicode characters in the form of: "here comes a unicode character: \u05d9 !"
The default encoding is UTF-8. (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used.
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
When you decode, it seems that this would be an appropriate use for the unicode replacement character, U+FFFD
.
From the Unicode Character Database:
What do you mean by “restricted codepoint”? What spec are you looking at that uses that language? (I can't find any such.)
If you are talking about the surrogates then yes: JavaScript knows almost nothing(*) about surrogates and treats all UTF-16 codepoints in any sequence as valid. JSON, being limited to what JavaScript supports, does the same.
*: the only part of JS I can think of that does anything special with surrogates is the encodeURIComponent function, as it uses UTF-8 encoding, in which an attempt to encode an invalid surrogate sequence cannot work. If you try to:
encodeURIComponent('\ud834\udd1e'.substring(0, 1))
you will get an exception.
(Gah! SO seems not to allow characters from outside the Basic Multilingual Plane to be posted directly. Tsk.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With