What exactly is the point of these as proposed by N4267 ?
Their only function seems to be to prevent extended ASCII characters or partial UTF-8 code points from being specified. They still store in a fixed-width 8-bit char (which, as I understand it, is the correct and best way to handle UTF-8 anyway for almost all use cases), so they don't support non-ASCII characters at all. What is going on?
(Actually I'm not entirely sure I understand the need for UTF-8 string literals either. I guess it's the worry of compilers doing weird/ambiguous things with Unicode strings coupled with validation of the Unicode?)
UTF-8 is a “variable-width” encoding standard. This means that it encodes each code point with a different number of bytes, between one and four. As a space-saving measure, commonly used code points are represented with fewer bytes than infrequently appearing code points.
A character literal contains a sequence of characters or escape sequences enclosed in single quotation mark symbols, for example 'c' . A character literal may be prefixed with the letter L, for example L'c' . A character literal without the L prefix is an ordinary character literal or a narrow character literal.
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8.
The rationale is covered in by the Evolution Working Group issue 119: N4197 Adding u8 character literals, [tiny] Why no u8 character literals? which tracked the proposal and says:
We have five encoding-prefixes for string-literals (none, L, u8, u, U) but only four for character literals -- the missing one is u8 for character literals.
This matters for implementations where the narrow execution character set is not ASCII. In such a case, u8 character literals would provide an ideal way to write character literals with guaranteed ASCII encoding (the single-code-unit u8 encodings are exactly ASCII), but... we don't provide them. Instead, the best one can do is something like this:
char x_ascii = { u'x' };
... where we'll get a narrowing error if the codepoint doesn't fit in a 'char'. (Note that this is not quite the same as u8'x', which would give us an error if the codepoint was not representable as a single code unit in UTF-8.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With