How can char[] represent an UTF-8 string?

Tags:

In C11, a new string literal has been added with the prefix u8. This returns an array of chars with the text encoded to UTF-8. How is this even possible? Isn't a normal char signed? Meaning it has one bit less of information to use because of the sign-bit? My logic would depict that a string of UTF-8 text would need to be an array of unsigned chars.

768

asked Jan 11 '12 11:01

dodehoekspiegel

2 Answers

There is a potential problem here:

If an implementation with CHAR_BIT == 8 uses sign-magnitude representation for char (so char is signed), then when UTF-8 requires the bit-pattern 10000000, that's a negative 0. So if the implementation further does not support negative 0, then a given UTF-8 string might contain an invalid (trap) value of char, which is problematic. Even if it does support negative zero, the fact that bit pattern 10000000 compares equal as a char to bit pattern 00000000 (the nul terminator) is liable to cause problems when using UTF-8 data in a char[].

I think this means that for sign-magnitude C11 implementations, char needs to be unsigned. Normally it's up to the implementation whether char is signed or unsigned, but of course if char being signed results in failing to implement UTF-8 literals correctly then the implementer just has to pick unsigned. As an aside, this has been the case for non-2's complement implementations of C++ all along, since C++ allows char as well as unsigned char to be used to access object representations. C only allows unsigned char.

In 2's complement and 1s' complement, the bit patterns required for UTF-8 data are valid values of signed char, so the implementation is free to make char either signed or unsigned and still be able to represent UTF-8 strings in char[]. That's because all 256 bit patterns are valid 2's complement values, and UTF-8 happens not to use the byte 11111111 (1s' complement negative zero).

179

answered Oct 05 '22 14:10

Steve Jessop

Isn't a normal char signed?

It's implementation-dependent whether char is signed or unsigned.

Further, the sign bit isn't "lost", it can still be used to represent information, and char is not necessarily 8 bits large (it might be larger on some platforms).

answered Oct 05 '22 15:10

Fred Foo

Related questions
                            
                                Is it necessary to reset the fd_set between select system call?
                            
                                Memory profiler for C [closed]
                            
                                Is there any reason to use malloc over PyMem_Malloc?
                            
                                c preprocessor - fail if compiling after certain date
                            
                                Efficient computation of kronecker products in C
                            
                                Finding the largest and smallest integers in C
                            
                                Why alignment is 16 bytes on 64 bit architecture? [duplicate]
                            
                                Calling C DLL from C#
                            
                                Desperately seeking the answer to my pointer problem
                            
                                Hiding command line arguments for C program in Linux
                            
                                Is there any signal-processing algorithm that could reverse-engineer how the sound wave was produced through the vocal system of group of humans?
                            
                                C/C++. Advantages of libraries over combined object files
                            
                                zombie process can't be killed
                            
                                getopt value stays null
                            
                                How to display an image in full screen borderless window in openCV
                            
                                How is dynamically allocated space freed when a program is interrupted using Ctrl-C?
                            
                                What is the difference between _chdir and SetCurrentDirectory in windows?
                            
                                In a C header file I saw `[*]` used as array bound. What does this mean?
                            
                                How to distinguish constant string from char* in C macro
                            
                                Installation of all OpenGL libraries for development in Ubuntu 11.10

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can char[] represent an UTF-8 string?

Tags:

c

string

utf-8

c11

dodehoekspiegel

People also ask

2 Answers

Steve Jessop

Fred Foo

Recent Activity

Donate For Us