Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can char[] represent an UTF-8 string?

Tags:

c

string

utf-8

c11

In C11, a new string literal has been added with the prefix u8. This returns an array of chars with the text encoded to UTF-8. How is this even possible? Isn't a normal char signed? Meaning it has one bit less of information to use because of the sign-bit? My logic would depict that a string of UTF-8 text would need to be an array of unsigned chars.

like image 768
dodehoekspiegel Avatar asked Jan 11 '12 11:01

dodehoekspiegel


People also ask

What is a UTF-8 encoded string?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

Can ASCII be read as UTF-8?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.

What characters are not allowed in UTF-8?

Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.


2 Answers

There is a potential problem here:

If an implementation with CHAR_BIT == 8 uses sign-magnitude representation for char (so char is signed), then when UTF-8 requires the bit-pattern 10000000, that's a negative 0. So if the implementation further does not support negative 0, then a given UTF-8 string might contain an invalid (trap) value of char, which is problematic. Even if it does support negative zero, the fact that bit pattern 10000000 compares equal as a char to bit pattern 00000000 (the nul terminator) is liable to cause problems when using UTF-8 data in a char[].

I think this means that for sign-magnitude C11 implementations, char needs to be unsigned. Normally it's up to the implementation whether char is signed or unsigned, but of course if char being signed results in failing to implement UTF-8 literals correctly then the implementer just has to pick unsigned. As an aside, this has been the case for non-2's complement implementations of C++ all along, since C++ allows char as well as unsigned char to be used to access object representations. C only allows unsigned char.

In 2's complement and 1s' complement, the bit patterns required for UTF-8 data are valid values of signed char, so the implementation is free to make char either signed or unsigned and still be able to represent UTF-8 strings in char[]. That's because all 256 bit patterns are valid 2's complement values, and UTF-8 happens not to use the byte 11111111 (1s' complement negative zero).

like image 179
Steve Jessop Avatar answered Oct 05 '22 14:10

Steve Jessop


Isn't a normal char signed?

It's implementation-dependent whether char is signed or unsigned.

Further, the sign bit isn't "lost", it can still be used to represent information, and char is not necessarily 8 bits large (it might be larger on some platforms).

like image 24
Fred Foo Avatar answered Oct 05 '22 15:10

Fred Foo