Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confusing sizeof(char) by ISO/IEC in different character set encoding like UTF-16

Assuming that a program is running on a system with UTF-16 encoding character set. So according to The C++ Programming Language - 4th, page 150:

A char can hold a character of the machine’s character set.

→ I think that a char variable will have the size is 2-bytes.

But according to ISO/IEC 14882:2014:

sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1".

or The C++ Programming Language - 4th, page 149:

"[...], so by definition the size of a char is 1"

→ It is fixed with size is 1.

Question: Is there a conflict between these statements above or is the sizeof(char) = 1 just a default (definition) value and will be implementation-defined depends on each system?

like image 418
kembedded Avatar asked Mar 30 '15 03:03

kembedded


People also ask

What is UTF-16 character set?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.

What is the UTF-8 and UTF-16 encoding techniques?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Which character encoding is best?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.

What does UTF-16 Le mean?

UTF-16LE: A character encoding that maps code points of Unicode character set to a sequence of 2 bytes (16 bits). UTF-16LE stands for Unicode Transformation Format - 16-bit Little Endian.


1 Answers

The C++ standard (and C, for that matter) effectively define byte as the size of a char type, not as an eight-bit quantity1. As per C++11 1.7/1 (my bold):

The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined.

Hence the expression sizeof(char) is always 1, no matter what.

If you want to see whether you baseline char variable (probably the unsigned variant would be best) can actually hold a 16-bit value, the item you want to look at is CHAR_BIT from <climits>. This holds the number of bits in a char variable.


1 Many standards, especially ones related to communications protocols, use the more exact term octet for an eight-bit value.

like image 104
paxdiablo Avatar answered Nov 10 '22 20:11

paxdiablo