I've been playing around with python built-ins and have gotten some confusing (for me) results.
Take a look at this code:
>>> 'ü'.encode()
b'\xc3\xbc'
Why was \xc3\xbc
(195 and 188 in decimal) returned? If you look at the ascii table, we see that ü
is the 129'th character. Or if you take a look here, we see that ü
is the 252'nd Unicode character, which is what you get from
>>> ord('ü')
252
So where is the \xc3\xbc
coming from and why is it split up into two bytes? and when you decode: b'\xc3\xbc'.decode()
, how does it know that these two bytes are for one character?
UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.
Why use UTF-8? An HTML page can only be in one encoding. You cannot encode different parts of a document in different encodings. A Unicode-based encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages.
UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.
On the table you're looking at, you're looking at the section titled "Extended ASCII", more commonly known at ISO/IEC 8859, or latin1. ASCII, as a character set, defines 7-bit characters from 0 to 127. latin1 defines the other 128 single-byte characters and is an extension of ASCII. Python uses UTF-8, which extends ASCII (and hence is compatible with it) but is incompatible with latin1.
The character ü is has Unicode codepoint 0xFC (252 in decimal) and, when using UTF-8, is encoded using two characters.
Lots of online ASCII tables get this wrong. It's inaccurate to call the code points 128 up to 255 ASCII characters, because ASCII doesn't claim to assign any value to those code points.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With