Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode encoding for string literals in C++11

Following a related question, I'd like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types:

char     a =  '\x30';         // character, no semantics wchar_t  b = L'\xFFEF';       // wide character, no semantics char16_t c = u'\u00F6';       // 16-bit, assumed UTF16? char32_t d = U'\U0010FFFF';   // 32-bit, assumed UCS-4 

And the string literals:

char     A[] =  "Hello\x0A";         // byte string, "narrow encoding" wchar_t  B[] = L"Hell\xF6\x0A";      // wide string, impl-def'd encoding char16_t C[] = u"Hell\u00F6";        // (1) char32_t D[] = U"Hell\U000000F6\U0010FFFF"; // (2) auto     E[] = u8"\u00F6\U0010FFFF"; // (3) 

The question is this: Are the \x/\u/\U character references freely combinable with all string types? Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes? Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence? And similarly for u8? In (1), can I write lone surrogates with \u? Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

This is a bit of an open-ended question, but I'd like to get as complete a picture as possible of the new UTF-encoding and type facilities of the new C++11.

like image 541
Kerrek SB Avatar asked Jul 22 '11 21:07

Kerrek SB


People also ask

What type is a string literal in C?

In C the type of a string literal is a char[]. In C++, an ordinary string literal has type 'array of n const char'. For example, The type of the string literal "Hello" is "array of 6 const char". It can, however, be converted to a const char* by array-to-pointer conversion.

What is a string literal C++?

String literals. A string literal represents a sequence of characters that together form a null-terminated string. The characters must be enclosed between double quotation marks.

Where are string literals stored C++?

Most likely, string literals will be stored in read-only segments of memory since they never change. In the old compiler days, you used to have static data like these literals, and global but changeable data. These were stored in the TEXT (code) segment and DATA (initialised data) segment.

Are string literals const?

String constants, also known as string literals, are a special type of constants which store fixed sequences of characters. A string literal is a sequence of any number of characters surrounded by double quotes: "This is a string."


1 Answers

Are the \x/\u/\U character references freely combinable with all string types?

No. \x can be used in anything, but \u and \U can only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, \u and \U can be used as you see fit.

Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes?

Not in the way you mean. \x, \u, and \U are converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_t is a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"\u1024" would create a string containing 2 chars plus a null terminator. The literal u"\u1024" would create a string containing 1 char16_t plus a null terminator.

The number of code units used is based on the Unicode encoding.

Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?

u"" creates a UTF-16 encoded string. u8"" creates a UTF-8 encoded string. They will be encoded per the Unicode specification.

In (1), can I write lone surrogates with \u?

Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for \u or \U.

Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

Absolutely not. Well, allow me to rephrase that.

std::basic_string doesn't deal with Unicode encodings. They certainly can store UTF-encoded strings. But they can only think of them as sequences of char, char16_t, or char32_t; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length() will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless

It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.

Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.

like image 60
Nicol Bolas Avatar answered Sep 23 '22 05:09

Nicol Bolas