Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

latin-1 vs unicode in python

I was reading this high rated post in SO on unicodes

Here is an `illustration given there :

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

and the explanation were given as

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

My question is: why does the terminal match to the latin-1 character map when the encoding is 'UTF-8'?

Also when I tried

>>> print '\xe9'
?
>>> print u'\xe9'
é

I get different result for the first one than what is described above. why is this discrepancy and where does latin-1 come to play in this picture?

like image 669
eagertoLearn Avatar asked Feb 19 '14 19:02

eagertoLearn


People also ask

Why do we use encoding Latin-1 in Python?

The latin-1 encoding in Python implements ISO_8859-1:1987 which maps all possible byte values to the first 256 Unicode code points, and thus ensures decoding errors will never occur regardless of the configured error handler.

What is Unicode in Python?

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

Is Latin-1 a subset of UTF-8?

The Latin-1 characters in the range 128-255 are not valid within a UTF-8 context. Although they do share the same character codes, in UTF-8 they are represented differently.

Is Python Unicode or ASCII?

Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.


1 Answers

You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.

The print output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.

When a terminal is set to UTF-8, the \xe9 byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é

If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph instead:

>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�

That's because in UTF-8, \xe9 is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:

>>> print '\xe9\x80\x80'
退

because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.

If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

like image 71
Martijn Pieters Avatar answered Oct 19 '22 23:10

Martijn Pieters