I recently went through an article on Character Encoding. I've a concern on a certain point mentioned there.
In the first figure, the author shows the characters, their code points in various character sets and how they are encoded in various encoding formats.
For example the code point of é is E9
.
In ISO-8859-1
encoding it is represented as E9
.
In UTF-16
it is represented as 00 E9
.
But in UTF-8
it is represented using 2 bytes, C3 A9
.
My question is why is this required? It can be represented with 1 byte. Why are two bytes used? Can you please let me know?
UTF-8 uses the 2 high bits (bit 6 and bit 7) to indicate if there are any more bytes: Only the low 6 bits are used for the actual character data. That means that any character over 7F
requires (at least) 2 bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With