Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the character é encoded as 0xC3 0xA9 in UTF-8?

If I write the character é to a file and I open it with an hexadecimal editor I can see the bytes 0xC3, 0xA9.

From Wikipedia, the first byte it's called the leading byte and the second, the trailing byte. 0xC3 it's a metadata byte that means that the character it's encoded with 1 byte, 0xA9, but the unicode value for é is 0xE9.

I basically want to know why é it's encoded with a 0xA9 instead of 0xE9. How the text editors convert from 0xC3A9 to 0xE9? Any shift operation?

like image 995
Gabriel Llamas Avatar asked Dec 16 '22 00:12

Gabriel Llamas


1 Answers

What makes you think that 0xC3 is "a metadata byte"?

Every byte in UTF-8 contains relevant information about the codepoint that is encoded.

The first byte of a UTF-8 encoded codepoint contains a marker (number of leading 1s) that indicates the total number of bytes used to encode the codepoint(*)and the first few bits of the actual codepoint. All trailing bytes then contain a "continuation marker" (the bits 10) and 6 more bits of the encoded codepoint.

The Wikipedia article on UTF-8 has a pretty good description of the process.

There is an encoding that uses the codepoint value directly: UTF-32 (a.k.a UCS-4) which is basically "use the codepoint value as a 32bit value"

(*) The marker is actually remarkably easy: if the byte starts with (i.e. it's most significant bits are) 0, then it's a single-byte encoding (i.e. a codepoint between 0 and 127). If it starts with 10, then it's a continuation byte. If it's 110, 1110 or 11110 then it's the start of a 2-, 3- or 4-byte sequence, respectively. 111110 and 1111110 used to be defined as well, but are no longer valid in modern UTF-8 (since those are only needed to encode values that are guaranteed to never be used in the Unicode standard).

like image 95
Joachim Sauer Avatar answered Jun 17 '23 04:06

Joachim Sauer