Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why u8'A' can be a char type while UTF-8 can be up to 4 bytes and char is normally 1 byte?

Tags:

c++

I was reading What is the use of wchar_t in general programming? and found something confusing in the accepted answer:

It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.

And I find this from my textbook:

image

Isn't Unicode encoding with UTF-8 is at most 4 bytes? char for most platforms is 1 byte. Do I misunderstand something?


Update:

After searching and reading, now I know that:

  1. code points and code units are different stuff. Code point is unique while code units rely on encoding.
  2. u8'a'(a char, not string here) is only allowed for basic character set(the ASCII and it's control character stuff), and its value is the corresponding 'a''s code unit value, and for ascii characters, code units are same value as code points. (This is what @codekaizer's answer say)
  3. std::string::size() returns code units.

So the editors are all dealing with code units right? And If I change my file encoding from utf8 to uft32, then size of ə would be 4?

like image 391
Rick Avatar asked May 15 '18 05:05

Rick


People also ask

How many bytes is a character in UTF 8?

Character-set Description; UTF-8: A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for e-mail and web pages: UTF-16

What is UTF-8?

v. t. e. UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one- byte (8-bit) code units.

What are the most commonly used Unicode character encodings?

The most commonly used encodings are UTF-8 and UTF-16: A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard.

What is the difference between Unicode and UTF-8 in HTML5?

If an HTML5 web page uses a different character set than UTF-8, it should be specified in the <meta> tag like: Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points). A = 65, B = 66, C = 67, ....


2 Answers

Isn't unicode encoding with utf8 is at most 4 bytes?

As per lex.ccon/3, emphasis mine:

A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.

Single UTF-8 code unit is 1 byte.

like image 61
Joseph D. Avatar answered Oct 23 '22 10:10

Joseph D.


You are confusing code points with code units.

In UTF-8 each code unit (≈ data type used by a particular encoding) is one byte (8 bit), so it can be represented in a C++ program by the char type (which the standard guarantees to be at least 8 bit).

Now, of course you cannot represent all Unicode code points (≈ character/glyph) in just a single code unit if it is so small - they are currently well over 1 million, while a byte can have only 256 distinct values. For this reason, UTF-8 uses more code units to represent a single code point (and, to save space and for compatibility, uses a variable length encoding). So, the 😀 code point (U+1F600) will be mapped to 4 code units (f0 9f 98 80).

Most importantly, C++ almost everywhere is concerned just with code units - strings are treated mostly as opaque binary blobs (with the exception of the 0 byte for C strings). For example, strlen and std::string::size() will all report you the number of code units, not of code points.

The u8 cited above is one of the rare exceptions. It's an indication to the compiler that the string enclosed in the literal must be mapped from whatever the encoding the compiler is using to read the source file to an UTF-8 string.

like image 4
Matteo Italia Avatar answered Oct 23 '22 09:10

Matteo Italia