Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Must char16_t strings use UTF-16 encoding?

I've been digging the specification for a while now and cannot find any conclusive clauses to support either yes/no.

Does the following statement:

char16_t *s = u"asdf";

imply/enforce that the string literal "asdf" must be encoded in UTF-16?

From all I can deduce, it's a yes.

However, in this proposal n2018 it says only when "__STDC_UTF_16__" is defined that char16_t literals are UTF-16 encoded, so that leaves open the door that when "__STDC_UTF_16__" is undefined, char16_t literals can be encoded anyway the compiler wants.

After all, the standard only guarantees the size, signed-ness and underlying representation of char16_t, it mentions nothing about how a compiler must encode a char16_t literal or string literal.

In the spec, it says

The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’. [Note: The size of a char16_t string literal is the number of code units, not the number of characters. —end note ]

This seems to mean that it is implicitly assumed that char16_t string literals are UTF16 encoded because "surrogate pair" is a UTF-16 concept.

Let me know if there's anything vague in the question.

like image 749
igbgotiz Avatar asked Apr 01 '14 04:04

igbgotiz


2 Answers

The __STDC_UTF_16__ bits did not make it into the standard text. That is in the proposal probably because it was taken from a similar proposal for the C language. The C++ standard simply removed any and all of this nonsense and made it UTF-16 or GTFO.

like image 110
R. Martinho Fernandes Avatar answered Sep 24 '22 00:09

R. Martinho Fernandes


The standard is technically unconcerned with the underlying encoding, and specifies only that the value of a single char16_t must correspond to a UCS codepoint in the range 0~0xFFFF

§ 2.14.3

2 A character literal that begins with the letter u, such as u’y’, is a character literal of type char16_t. The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit.

Strings on the other hand can include surrogate pairs

§ 2.14.5

10 A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.

Only UTF-16 meets both of these requirements, although the standard leaves the door open for future compatible encodings, however unlikely that may be.

like image 29
user657267 Avatar answered Sep 21 '22 00:09

user657267