Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle unicode values in JSON strings?

I'm writing a JSON parser in C++ and am facing a problem when parsing JSON strings:

The JSON specification states that JSON strings can contain unicode characters in the form of:

"here comes a unicode character: \u05d9 !"

My JSON parser tries to map JSON strings to std::string so usually, one character of the JSON strings becomes one character of the std::string. However for those unicode characters, I really don't know what to do:

Should I just put the raw bytes values in my std::string like so:

std::string mystr;
mystr.push_back('\0x05');
mystr.push_back('\0xd9');

Or should I interpret the two characters with a library like iconv and store the UTF-8 encoded result in my string instead ?

Should I use a std::wstring to store all the characters ? What then on *NIX OSes where wchar_t are 4-bytes long ?

I sense something is wrong in my solutions but I fail to understand what. What should I do in that situation ?

like image 973
ereOn Avatar asked Oct 28 '12 08:10

ereOn


2 Answers

After some digging and thanks to H2CO3's comments and Philipp's comments, I finally could understand how this is supposed to work:

Reading the RFC4627, Section 3. Encoding:

  1. Encoding

    JSON text SHALL be encoded in Unicode. The default encoding is
    UTF-8.

    Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet
    stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
    at the pattern of nulls in the first four octets.

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8
    

So it appears a JSON octet stream can be encoded in UTF-8, UTF-16, or UTF-32 (in both their BE or LE variants, for the last two).

Once that is clear, Section 2.5. Strings explains how to handle those \uXXXX values in JSON strings:

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".

With more complete explanations for characters not in the Basic Multilingual Plane.

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

Hope this helps.

like image 64
ereOn Avatar answered Oct 06 '22 01:10

ereOn


If I were you, I would use std::string to store UTF-8 and UTF-8 only. If incoming JSON text does not contain any \uXXXX sequences, std::string can be used as is, byte to byte, without any conversion.

When you parse \uXXXX, you can simply decode it and convert it to UTF-8, effectively treating it as if it was true UTF-8 character in its place - this is what most JSON parsers are doing anyway (libjson for sure).

Granted, with this approach reading JSON with \uXXXX and immediately dumping it back using your library is likely to lose \uXXXX sequences and replace them with their true UTF-8 representations, but who really cares? Ultimately, net result is still exactly the same.

like image 38
mvp Avatar answered Oct 06 '22 00:10

mvp