RFC 4627 on Json reads:
Encoding
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.
What does it mean "Since the first two characters of a JSON text will always be ASCII characters [RFC0020]"? I've looked at RFC0020 but couldn't find anything about it. JSON could be {" or { " (ie whitespace before the quote.
Since any JSON can represent unicode characters in escaped sequence \uXXXX , JSON can always be encoded in ASCII.
All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). Any character may be escaped.
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
The JSON specification states that JSON strings can contain unicode characters in the form of: "here comes a unicode character: \u05d9 !"
It means that since JSON will always start with ASCII characters (non-ASCII is only permitted in strings, which cannot be the root object), it is possible to determine from the start of the stream/file what encoding it is in.
UTF-16 and UTF-32 should have a BOM that appears at the start of the stream and by finding out what it is, you can determine the exact encoding. This is possible as one can determine if the first characters are JSON or not.
I assume the spec specifically mentions this as for many other text streams/files, this is not always possible (as most text files can start with any two characters and the two first bytes of the actual file are not known in advance).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With