Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

str.encode() giving unexpected results

I've been playing around with python built-ins and have gotten some confusing (for me) results.

Take a look at this code:

>>> 'ü'.encode()
b'\xc3\xbc'

Why was \xc3\xbc (195 and 188 in decimal) returned? If you look at the ascii table, we see that ü is the 129'th character. Or if you take a look here, we see that ü is the 252'nd Unicode character, which is what you get from

>>> ord('ü')
252

So where is the \xc3\xbc coming from and why is it split up into two bytes? and when you decode: b'\xc3\xbc'.decode(), how does it know that these two bytes are for one character?

like image 730
Have a nice day Avatar asked Apr 25 '21 01:04

Have a nice day


People also ask

How UTF-8 encoding?

UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.

What are valid UTF-8 characters?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.

Why UTF-8 encoding is used?

Why use UTF-8? An HTML page can only be in one encoding. You cannot encode different parts of a document in different encodings. A Unicode-based encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages.

What is difference between UTF-8 and ASCII?

UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.


Video Answer


1 Answers

On the table you're looking at, you're looking at the section titled "Extended ASCII", more commonly known at ISO/IEC 8859, or latin1. ASCII, as a character set, defines 7-bit characters from 0 to 127. latin1 defines the other 128 single-byte characters and is an extension of ASCII. Python uses UTF-8, which extends ASCII (and hence is compatible with it) but is incompatible with latin1.

The character ü is has Unicode codepoint 0xFC (252 in decimal) and, when using UTF-8, is encoded using two characters.

Lots of online ASCII tables get this wrong. It's inaccurate to call the code points 128 up to 255 ASCII characters, because ASCII doesn't claim to assign any value to those code points.

like image 146
Silvio Mayolo Avatar answered Sep 27 '22 21:09

Silvio Mayolo