Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Unicode in a C++ source file

Tags:

c++

c

unicode

I'm working with a C++ sourcefile in which I would like to have a quoted string that contains Asian Unicode characters.

I'm working with QT on Windows, and the QT Creator development environment has no problem displaying the Unicode. The QStrings also have no problem storing Unicode. When I paste in my Unicode, it displays fine, something like:

#define MY_STRING 鸟

However, when I save, my lovely Unicode characters all become ? marks.

I tried to open up the source file and resave it as Unicode encoded. It then displays and saves correctly in QT Creator. However, on compile, it seems like the compiler has no idea what to do with this, and throws a ton of misguided errors and warnings, such as "stray \255 in program" and "null character(s) ignored".

What's the correct way to include Unicode in C++ source files?

like image 606
William Jones Avatar asked Jul 24 '10 20:07

William Jones


3 Answers

Personally, I don't use any non-ASCII characters in source code. The reason is that if you use arbitary Unicode characters in your source files, you have to worry about the encoding that the compiler considers the source file to be in, what execution character set it will use and how it's going to do the source to execution character set conversion.

I think that it's a much better idea to have Unicode data in some sort of resource file, which could be compiled to static data at compile time or loaded at runtime for maximum flexibility. That way you can control how the encoding occurs, at not worry about how the compiler behaves which may be influence by the local locale settings at compile time.

It does require a bit more infrastructure, but if you're having to internationalize it's well worth spending the time choosing or developing a flexible and robust strategy.

While it's possible to use universal character escapes (L'\uXXXX') or explicitly encoded byte sequences ("\xXX\xYY\xZZ") in source code, this makes Unicode strings virtually unreadable for humans. If you're having translations made it's easier for most people involved in the process to be able to deal with text in an agreed universal character encoding scheme.

like image 113
CB Bailey Avatar answered Oct 22 '22 09:10

CB Bailey


Using the L prefix and \u or \U notation for escaping Unicode characters:

Section 6.4.3 of the C99 specification defines the \u escape sequences.

Example:

 #define MY_STRING L"A \u8801 B"   
 /* A congruent-to B */
like image 45
Heath Hunnicutt Avatar answered Oct 22 '22 08:10

Heath Hunnicutt


Are you using a wchar_t interface? If so, you want L"\u1234" for a wide string containing Unicode character U+1234 (hex 0x1234). (Looking at the QString header file I think this is what you need.)

If not and your interface is UTF-8 then you'll need to encode your character in UTF-8 first and then create a narrow string containing that, e.g. "\xE0\xF8" or similar.

like image 27
Rup Avatar answered Oct 22 '22 10:10

Rup