Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I input 4-byte UTF-8 characters?

I am writing a small app which I need to test with utf-8 characters of different number of byte lengths.

I can input unicode characters to test that are encoded in utf-8 with 1,2 and 3 bytes just fine by doing, for example:

string in = "pi = \u3a0";

But how do I get a unicode character that is encoded with 4-bytes? I have tried:

string in = "aegan check mark = \u10102";

Which as far as I understand should be outputting . But when I print that out I get ᴶ0

What am I missing?

EDIT:

I got it to work by adding leading zeros:

string in = "\U00010102";

Wish I had thought of that sooner :)

like image 418
Cactuar Avatar asked Oct 15 '08 13:10

Cactuar


1 Answers

There's a longer form of escape in the pattern \U followed by eight digits, rather than \u followed by four digits. This is also used in Java and Python, amongst others:

>>> '\xf0\x90\x84\x82'.decode("UTF-8")
u'\U00010102'

However, if you are using byte strings, why not just escape each byte like above, rather than relying on the compiler to convert the escape to a UTF-8 string? This would seem to be more portable as well - if I compile the following program:

#include <iostream>
#include <string>

int main()
{
    std::cout << "narrow: " << std::string("\uFF0E").length() <<
        " utf8: " << std::string("\xEF\xBC\x8E").length() <<
        " wide: " << std::wstring(L"\uFF0E").length() << std::endl;

    std::cout << "narrow: " << std::string("\U00010102").length() <<
        " utf8: " << std::string("\xF0\x90\x84\x82").length() <<
        " wide: " << std::wstring(L"\U00010102").length() << std::endl;
}

On win32 with my current options cl gives:

warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)

The compiler tries to convert all unicode escapes in byte strings to the system code page, which unlike UTF-8 cannot represent all unicode characters. Oddly it has understood that \U00010102 is \uD800\uDD02 in UTF-16 (its internal unicode representation) and mangled the escape in the error message...

When run, the program prints:

narrow: 2 utf8: 3 wide: 1
narrow: 2 utf8: 4 wide: 2

Note that the UTF-8 bytestrings and the wide strings are correct, but the compiler failed to convert "\U00010102", giving the byte string "??", an incorrect result.

like image 189
gz. Avatar answered Sep 28 '22 19:09

gz.