Assuming that a program is running on a system with UTF-16 encoding character set. So according to The C++ Programming Language - 4th, page 150:
A char can hold a character of the machine’s character set.
→ I think that a char variable will have the size is 2-bytes.
But according to ISO/IEC 14882:2014:
sizeof(char)
,sizeof(signed char)
andsizeof(unsigned char)
are 1".
or The C++ Programming Language - 4th, page 149:
"[...], so by definition the size of a char is 1"
→ It is fixed with size is 1.
Question: Is there a conflict between these statements above or is the sizeof(char) = 1
just a default (definition) value and will be implementation-defined depends on each system?
UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.
UTF-16LE: A character encoding that maps code points of Unicode character set to a sequence of 2 bytes (16 bits). UTF-16LE stands for Unicode Transformation Format - 16-bit Little Endian.
The C++ standard (and C, for that matter) effectively define byte
as the size of a char
type, not as an eight-bit quantity1. As per C++11 1.7/1
(my bold):
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined.
Hence the expression sizeof(char)
is always 1, no matter what.
If you want to see whether you baseline char
variable (probably the unsigned
variant would be best) can actually hold a 16-bit value, the item you want to look at is CHAR_BIT
from <climits>
. This holds the number of bits in a char
variable.
1 Many standards, especially ones related to communications protocols, use the more exact term octet
for an eight-bit value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With