Why u8'A' can be a char type while UTF-8 can be up to 4 bytes and char is normally 1 byte?

2 Answers

Isn't unicode encoding with utf8 is at most 4 bytes?

As per lex.ccon/3, emphasis mine:

A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.

Single UTF-8 code unit is 1 byte.

answered Oct 23 '22 10:10

Joseph D.

You are confusing code points with code units.

In UTF-8 each code unit (≈ data type used by a particular encoding) is one byte (8 bit), so it can be represented in a C++ program by the char type (which the standard guarantees to be at least 8 bit).

Now, of course you cannot represent all Unicode code points (≈ character/glyph) in just a single code unit if it is so small - they are currently well over 1 million, while a byte can have only 256 distinct values. For this reason, UTF-8 uses more code units to represent a single code point (and, to save space and for compatibility, uses a variable length encoding). So, the 😀 code point (U+1F600) will be mapped to 4 code units (f0 9f 98 80).

Most importantly, C++ almost everywhere is concerned just with code units - strings are treated mostly as opaque binary blobs (with the exception of the 0 byte for C strings). For example, strlen and std::string::size() will all report you the number of code units, not of code points.

The u8 cited above is one of the rare exceptions. It's an indication to the compiler that the string enclosed in the literal must be mapped from whatever the encoding the compiler is using to read the source file to an UTF-8 string.

answered Oct 23 '22 09:10

Matteo Italia

Related questions
                            
                                How to change the last argument in the parameter pack?
                            
                                QT Text.WordWrap not working inside ColumnLayout
                            
                                What will be the exact code to get count of last level cache misses on Intel Kaby Lake architecture
                            
                                Template function to receive a generic map as a parameter
                            
                                Compiler option to catch unassigned r-values
                            
                                Making a cpp sorted tuple
                            
                                Why does "if constexpr" not compile with Visual Studio 2017 15.3?
                            
                                What is the impact of wrapping an initializer list inside parenthesis?
                            
                                Simplest way to get memory size of std::array's underlying array?
                            
                                c++ template type deduction fail in cast operator
                            
                                How does const auto and auto const apply to pointers?
                            
                                Why does std::move take rvalue reference as argument?
                            
                                llvm JIT add library to module
                            
                                '&&' in function parameter pack
                            
                                Compile with c++17 mac
                            
                                Cython: Compile a Standalone Static Executable
                            
                                Automatically detect C++14 "return should use std::move" situation
                            
                                Is there any way for a C++ template function to take exactly N arguments?
                            
                                How to combine back_inserter with a transformation, C++
                            
                                Printing bits in long long number (C++)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why u8'A' can be a char type while UTF-8 can be up to 4 bytes and char is normally 1 byte?

Tags:

c++

Rick

People also ask

2 Answers

Joseph D.

Matteo Italia

Recent Activity

Donate For Us