Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSON Unicode escape sequence - lowercase or not?

I was reading RFC 4627 and I can't figure out if the following is valid JSON or not. Consider this minimalistic JSON text:

["\u005c"]

The problem is the lowercase c.

According to the text of the RFC it is allowed:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

(Emphasis mine)

The problem is that the RFC also contains the grammar for this:

char = unescaped /
       escape (
           %x22 /          ; "    quotation mark  U+0022
           %x5C /          ; \    reverse solidus U+005C
           %x2F /          ; /    solidus         U+002F
           %x62 /          ; b    backspace       U+0008
           %x66 /          ; f    form feed       U+000C
           %x6E /          ; n    line feed       U+000A
           %x72 /          ; r    carriage return U+000D
           %x74 /          ; t    tab             U+0009
           %x75 4HEXDIG )  ; uXXXX                U+XXXX

where HEXDIG is defined in referenced RFC 4234 as

HEXDIG         =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

which includes only uppercase letters.

FWIW, from what I researched most JSON parsers accept both upper and lowercase letters.

Question(s): What is actually correct? Is there a contradiction and the grammar in the RFC should be fixed?

like image 496
Daniel Frey Avatar asked Jun 13 '14 22:06

Daniel Frey


People also ask

How do you escape a character in JSON?

In JSON the only characters you must escape are \, ", and control codes. Thus in order to escape your structure, you'll need a JSON specific function. As you might know, all of the escapes can be written as \uXXXX where XXXX is the UTF-16 code unit¹ for that character.

What is Unicode escape sequence?

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.

Which of the following is an escape sequence?

An escape sequence contains a backslash (\) symbol followed by one of the escape sequence characters or an octal or hexadecimal number. A hexadecimal escape sequence contains an x followed by one or more hexadecimal digits (0-9, A-F, a-f).


1 Answers

I think it's explained by this part of RFC 4234:

ABNF strings are case-insensitive and the character set for these strings is us-ascii.

Hence:

    rulename = "abc"

and:

    rulename = "aBc"

will match "abc", "Abc", "aBc", "abC", "ABc", "aBC", "AbC", and "ABC".

On the other hand, the follow-on part is not terribly clear:

To specify a rule that IS case SENSITIVE, specify the characters individually.

For example:

    rulename    =  %d97 %d98 %d99

or

    rulename    =  %d97.98.99

In the case of the HEXDIG rule, they're individual characters to start with - but they're specified literally as "A" etc rather than %d41, so I suspect that means they're case-insensitive. It's not the clearest spec I've read :(

like image 184
Jon Skeet Avatar answered Oct 10 '22 23:10

Jon Skeet