I'm writing a JSON parser in C++ and am facing a problem when parsing JSON strings:
The JSON specification states that JSON strings can contain unicode characters in the form of:
"here comes a unicode character: \u05d9 !"
My JSON parser tries to map JSON strings to std::string
so usually, one character of the JSON strings becomes one character of the std::string
. However for those unicode characters, I really don't know what to do:
Should I just put the raw bytes values in my std::string
like so:
std::string mystr;
mystr.push_back('\0x05');
mystr.push_back('\0xd9');
Or should I interpret the two characters with a library like iconv
and store the UTF-8 encoded result in my string instead ?
Should I use a std::wstring
to store all the characters ? What then on *NIX OSes where wchar_t
are 4-bytes long ?
I sense something is wrong in my solutions but I fail to understand what. What should I do in that situation ?
After some digging and thanks to H2CO3's comments and Philipp's comments, I finally could understand how this is supposed to work:
Reading the RFC4627, Section 3. Encoding
:
Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8
So it appears a JSON octet stream can be encoded in UTF-8, UTF-16, or UTF-32 (in both their BE or LE variants, for the last two).
Once that is clear, Section 2.5. Strings
explains how to handle those \uXXXX
values in JSON strings:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
With more complete explanations for characters not in the Basic Multilingual Plane.
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
Hope this helps.
If I were you, I would use std::string to store UTF-8 and UTF-8 only. If incoming JSON text does not contain any \uXXXX sequences, std::string can be used as is, byte to byte, without any conversion.
When you parse \uXXXX, you can simply decode it and convert it to UTF-8, effectively treating it as if it was true UTF-8 character in its place - this is what most JSON parsers are doing anyway (libjson for sure).
Granted, with this approach reading JSON with \uXXXX and immediately dumping it back using your library is likely to lose \uXXXX sequences and replace them with their true UTF-8 representations, but who really cares? Ultimately, net result is still exactly the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With