I have a JSON file, which contains JSON from Clojure's data.json
library. The data came from Twitter where people seem to smile a lot.
$ cat /tmp/myfile | jq .
I get:
parse error: Invalid \uXXXX\uXXXX surrogate pair escape at line 1, column 14862268
The offending section is:
$ cut -c 14862258-14862269 /tmp/2017-02-23-2
79-7\ud83d",
So, this escape code was found in a real JSON file and JQ can't read it.
echo '"\ud83d"' | jq .
Fileformat.info seems to suggest that it should come in a pair:
SMILING FACE WITH OPEN MOUTH
"\uD83D\uDE03"
Is this truly an invalid character to find in a JSON file? Is my JSON technically invalid?
Is there a simple utility I can pipe the data through to strip out these characters prior to JQ? Or can I make JQ relax it interpretation?
The JSON specification says:
A string is a sequence of zero or more Unicode characters [UNICODE].
In this sense, the string "\ud83d" is NOT valid JSON ("+UD83D is not a valid Unicode character"), even though it conforms with the JSON ABNF. As the standards document goes on to say, there is a discrepancy between the string specification and the ABNF:
the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate). Instances of this have been observed, for example, when a library truncates a UTF-16 string without checking whether the truncation split a surrogate pair. The behavior of software that receives JSON texts containing such values is unpredictable ...
So it would be fair to say that:
"\uD83D" is not strictly valid JSON, even though it conforms to the ABNF;
jq is within its rights here;
jsonlint is wrong to accept "\uD83D".
See e.g How to remove non UTF-8 characters from text file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With