Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are the first 128 characters of utf-8 and ascii identical?

Are the first 128 characters of utf-8 and ascii identical?

utf-8 table

Ascii table

like image 741
Sebastian Nielsen Avatar asked Jan 25 '26 22:01

Sebastian Nielsen


1 Answers

Yes. This was an intentional choice in the design of UTF-8 so that existing 7-bit ASCII would be compatible.

The encoding is also designed intentionally so that 7-bit ASCII values cannot mean anything except their ASCII equivalent. For example, in UTF-16, the Euro symbol (€) is encoded as 0x20 0xAC. But 0x20 is SPACE in ASCII. So if an ASCII-only algorithm tries to space-delimit a string like "€ 10" encoded in UTF-16, it'll corrupt the data.

This can't happen in UTF-8. € is encoded there as 0xE2 0x82 0xAC, none of which are legal 7-bit ASCII values. So an ASCII algorithm that naively splits on the ASCII SPACE (0x20) will still work, even though it doesn't know anything about UTF-8 encoding. (The same is true for any ASCII character like slash, comma, backslash, percent, etc.) UTF-8 is an incredibly clever text encoding.

like image 163
Rob Napier Avatar answered Jan 27 '26 10:01

Rob Napier