JQ can't parse an Unicode emoji character. Is it valid JSON?

Question

I have a JSON file, which contains JSON from Clojure's data.json library. The data came from Twitter where people seem to smile a lot.

$ cat /tmp/myfile | jq .

I get:

parse error: Invalid \uXXXX\uXXXX surrogate pair escape at line 1, column 14862268

The offending section is:

$ cut -c 14862258-14862269 /tmp/2017-02-23-2
79-7\ud83d",

So, this escape code was found in a real JSON file and JQ can't read it.

echo '"\ud83d"' | jq .

Fileformat.info seems to suggest that it should come in a pair:

SMILING FACE WITH OPEN MOUTH
"\uD83D\uDE03"

Is this truly an invalid character to find in a JSON file? Is my JSON technically invalid?
Is there a simple utility I can pipe the data through to strip out these characters prior to JQ? Or can I make JQ relax it interpretation?

peak · Accepted Answer

The JSON specification says:

A string is a sequence of zero or more Unicode characters [UNICODE].

In this sense, the string "\ud83d" is NOT valid JSON ("+UD83D is not a valid Unicode character"), even though it conforms with the JSON ABNF. As the standards document goes on to say, there is a discrepancy between the string specification and the ABNF:

the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable ...

So it would be fair to say that:

"\uD83D" is not strictly valid JSON, even though it conforms to the ABNF;
jq is within its rights here;
jsonlint is wrong to accept "\uD83D".

“... strip out these characters”

See e.g How to remove non UTF-8 characters from text file

JQ can't parse an Unicode emoji character. Is it valid JSON?

Tags:

json

unicode

standards

clojure

jq

Joe

1 Answers

“... strip out these characters”

peak

Recent Activity

Donate For Us

JQ can't parse an Unicode emoji character. Is it valid JSON?

Tags:

json

unicode

standards

clojure

jq

Joe

1 Answers

“... strip out these characters”

peak

Related questions

Recent Activity

Donate For Us