In C11, a new string literal has been added with the prefix u8. This returns an array of chars with the text encoded to UTF-8. How is this even possible? Isn't a normal char signed? Meaning it has one bit less of information to use because of the sign-bit? My logic would depict that a string of UTF-8 text would need to be an array of unsigned chars.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.
Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.
There is a potential problem here:
If an implementation with CHAR_BIT == 8
uses sign-magnitude representation for char
(so char
is signed), then when UTF-8 requires the bit-pattern 10000000
, that's a negative 0. So if the implementation further does not support negative 0, then a given UTF-8 string might contain an invalid (trap) value of char
, which is problematic. Even if it does support negative zero, the fact that bit pattern 10000000
compares equal as a char
to bit pattern 00000000
(the nul terminator) is liable to cause problems when using UTF-8 data in a char[]
.
I think this means that for sign-magnitude C11 implementations, char
needs to be unsigned. Normally it's up to the implementation whether char
is signed or unsigned, but of course if char
being signed results in failing to implement UTF-8 literals correctly then the implementer just has to pick unsigned. As an aside, this has been the case for non-2's complement implementations of C++ all along, since C++ allows char
as well as unsigned char
to be used to access object representations. C only allows unsigned char
.
In 2's complement and 1s' complement, the bit patterns required for UTF-8 data are valid values of signed char
, so the implementation is free to make char
either signed or unsigned and still be able to represent UTF-8 strings in char[]
. That's because all 256 bit patterns are valid 2's complement values, and UTF-8 happens not to use the byte 11111111
(1s' complement negative zero).
Isn't a normal char signed?
It's implementation-dependent whether char
is signed
or unsigned
.
Further, the sign bit isn't "lost", it can still be used to represent information, and char
is not necessarily 8 bits large (it might be larger on some platforms).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With