I was reading What is the use of wchar_t in general programming? and found something confusing in the accepted answer:
It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.
And I find this from my textbook:
Isn't Unicode encoding with UTF-8 is at most 4 bytes? char
for most platforms is 1 byte. Do I misunderstand something?
Update:
After searching and reading, now I know that:
std::string::size()
returns code units.So the editors are all dealing with code units right? And If I change my file encoding from utf8 to uft32, then size of ə
would be 4?
Character-set Description; UTF-8: A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for e-mail and web pages: UTF-16
v. t. e. UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one- byte (8-bit) code units.
The most commonly used encodings are UTF-8 and UTF-16: A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard.
If an HTML5 web page uses a different character set than UTF-8, it should be specified in the <meta> tag like: Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points). A = 65, B = 66, C = 67, ....
Isn't unicode encoding with utf8 is at most 4 bytes?
As per lex.ccon/3, emphasis mine:
A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.
Single UTF-8 code unit is 1 byte.
You are confusing code points with code units.
In UTF-8 each code unit (≈ data type used by a particular encoding) is one byte (8 bit), so it can be represented in a C++ program by the char
type (which the standard guarantees to be at least 8 bit).
Now, of course you cannot represent all Unicode code points (≈ character/glyph) in just a single code unit if it is so small - they are currently well over 1 million, while a byte can have only 256 distinct values. For this reason, UTF-8 uses more code units to represent a single code point (and, to save space and for compatibility, uses a variable length encoding). So, the 😀 code point (U+1F600) will be mapped to 4 code units (f0 9f 98 80).
Most importantly, C++ almost everywhere is concerned just with code units - strings are treated mostly as opaque binary blobs (with the exception of the 0 byte for C strings). For example, strlen
and std::string::size()
will all report you the number of code units, not of code points.
The u8
cited above is one of the rare exceptions. It's an indication to the compiler that the string enclosed in the literal must be mapped from whatever the encoding the compiler is using to read the source file to an UTF-8 string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With