I was reading this high rated post in SO on unicodes
Here is an `illustration given there :
$ python
>>> import sys
>>> print sys.stdout.encoding
UTF-8
>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>
and the explanation were given as
(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.
My question is: why does the terminal match to the latin-1 character map when the encoding
is 'UTF-8'
?
Also when I tried
>>> print '\xe9'
?
>>> print u'\xe9'
é
I get different result for the first one than what is described above. why is this discrepancy and where does latin-1
come to play in this picture?
The latin-1 encoding in Python implements ISO_8859-1:1987 which maps all possible byte values to the first 256 Unicode code points, and thus ensures decoding errors will never occur regardless of the configured error handler.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.
The Latin-1 characters in the range 128-255 are not valid within a UTF-8 context. Although they do share the same character codes, in UTF-8 they are represented differently.
Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.
You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.
The print
output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.
When a terminal is set to UTF-8, the \xe9
byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é
If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph �
instead:
>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�
That's because in UTF-8, \xe9
is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:
>>> print '\xe9\x80\x80'
退
because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.
If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With