Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does UTF-8 use more than one byte to represent some characters?

I recently went through an article on Character Encoding. I've a concern on a certain point mentioned there.

In the first figure, the author shows the characters, their code points in various character sets and how they are encoded in various encoding formats. For example the code point of é is E9. In ISO-8859-1 encoding it is represented as E9. In UTF-16 it is represented as 00 E9. But in UTF-8 it is represented using 2 bytes, C3 A9.

My question is why is this required? It can be represented with 1 byte. Why are two bytes used? Can you please let me know?

like image 679
Apps Avatar asked Aug 21 '11 04:08

Apps


1 Answers

UTF-8 uses the 2 high bits (bit 6 and bit 7) to indicate if there are any more bytes: Only the low 6 bits are used for the actual character data. That means that any character over 7F requires (at least) 2 bytes.

like image 61
Bohemian Avatar answered Oct 07 '22 06:10

Bohemian