I was reading RFC 4627 and I can't figure out if the following is valid JSON or not. Consider this minimalistic JSON text:
["\u005c"]
The problem is the lowercase c
.
According to the text of the RFC it is allowed:
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".
(Emphasis mine)
The problem is that the RFC also contains the grammar for this:
char = unescaped /
escape (
%x22 / ; " quotation mark U+0022
%x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
where HEXDIG
is defined in referenced RFC 4234 as
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
which includes only uppercase letters.
FWIW, from what I researched most JSON parsers accept both upper and lowercase letters.
Question(s): What is actually correct? Is there a contradiction and the grammar in the RFC should be fixed?
In JSON the only characters you must escape are \, ", and control codes. Thus in order to escape your structure, you'll need a JSON specific function. As you might know, all of the escapes can be written as \uXXXX where XXXX is the UTF-16 code unit¹ for that character.
A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.
An escape sequence contains a backslash (\) symbol followed by one of the escape sequence characters or an octal or hexadecimal number. A hexadecimal escape sequence contains an x followed by one or more hexadecimal digits (0-9, A-F, a-f).
I think it's explained by this part of RFC 4234:
ABNF strings are case-insensitive and the character set for these strings is us-ascii.
Hence:
rulename = "abc"
and:
rulename = "aBc"
will match "abc", "Abc", "aBc", "abC", "ABc", "aBC", "AbC", and "ABC".
On the other hand, the follow-on part is not terribly clear:
To specify a rule that IS case SENSITIVE, specify the characters individually.
For example:
rulename = %d97 %d98 %d99
or
rulename = %d97.98.99
In the case of the HEXDIG
rule, they're individual characters to start with - but they're specified literally as "A"
etc rather than %d41
, so I suspect that means they're case-insensitive. It's not the clearest spec I've read :(
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With