If I write the character é
to a file and I open it with an hexadecimal editor I can see the bytes 0xC3, 0xA9.
From Wikipedia, the first byte it's called the leading byte and the second, the trailing byte. 0xC3 it's a metadata byte that means that the character it's encoded with 1 byte, 0xA9, but the unicode value for é
is 0xE9.
I basically want to know why é
it's encoded with a 0xA9 instead of 0xE9. How the text editors convert from 0xC3A9 to 0xE9? Any shift operation?
What makes you think that 0xC3 is "a metadata byte"?
Every byte in UTF-8 contains relevant information about the codepoint that is encoded.
The first byte of a UTF-8 encoded codepoint contains a marker (number of leading 1s) that indicates the total number of bytes used to encode the codepoint(*)and the first few bits of the actual codepoint. All trailing bytes then contain a "continuation marker" (the bits 10
) and 6 more bits of the encoded codepoint.
The Wikipedia article on UTF-8 has a pretty good description of the process.
There is an encoding that uses the codepoint value directly: UTF-32 (a.k.a UCS-4) which is basically "use the codepoint value as a 32bit value"
(*) The marker is actually remarkably easy: if the byte starts with (i.e. it's most significant bits are) 0
, then it's a single-byte encoding (i.e. a codepoint between 0 and 127). If it starts with 10
, then it's a continuation byte. If it's 110
, 1110
or 11110
then it's the start of a 2-, 3- or 4-byte sequence, respectively. 111110
and 1111110
used to be defined as well, but are no longer valid in modern UTF-8 (since those are only needed to encode values that are guaranteed to never be used in the Unicode standard).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With