Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the point of the UTF-8 character literals proposed for C++17?

Tags:

What exactly is the point of these as proposed by N4267 ?

Their only function seems to be to prevent extended ASCII characters or partial UTF-8 code points from being specified. They still store in a fixed-width 8-bit char (which, as I understand it, is the correct and best way to handle UTF-8 anyway for almost all use cases), so they don't support non-ASCII characters at all. What is going on?

(Actually I'm not entirely sure I understand the need for UTF-8 string literals either. I guess it's the worry of compilers doing weird/ambiguous things with Unicode strings coupled with validation of the Unicode?)

like image 645
Muzer Avatar asked Aug 12 '15 15:08

Muzer


People also ask

What is the point of UTF-8?

UTF-8 is a “variable-width” encoding standard. This means that it encodes each code point with a different number of bytes, between one and four. As a space-saving measure, commonly used code points are represented with fewer bytes than infrequently appearing code points.

What are character literals in c?

A character literal contains a sequence of characters or escape sequences enclosed in single quotation mark symbols, for example 'c' . A character literal may be prefixed with the letter L, for example L'c' . A character literal without the L prefix is an ordinary character literal or a narrow character literal.

What is a UTF-8 sequence?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8.


1 Answers

The rationale is covered in by the Evolution Working Group issue 119: N4197 Adding u8 character literals, [tiny] Why no u8 character literals? which tracked the proposal and says:

We have five encoding-prefixes for string-literals (none, L, u8, u, U) but only four for character literals -- the missing one is u8 for character literals.

This matters for implementations where the narrow execution character set is not ASCII. In such a case, u8 character literals would provide an ideal way to write character literals with guaranteed ASCII encoding (the single-code-unit u8 encodings are exactly ASCII), but... we don't provide them. Instead, the best one can do is something like this:

char x_ascii = { u'x' }; 

... where we'll get a narrowing error if the codepoint doesn't fit in a 'char'. (Note that this is not quite the same as u8'x', which would give us an error if the codepoint was not representable as a single code unit in UTF-8.)

like image 173
Shafik Yaghmour Avatar answered Oct 19 '22 00:10

Shafik Yaghmour