latin-1 vs unicode in python

Tags:

I was reading this high rated post in SO on unicodes

Here is an `illustration given there :

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
Ã©
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

and the explanation were given as

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.

My question is: why does the terminal match to the latin-1 character map when the encoding is 'UTF-8'?

Also when I tried

Click to copy

>>> print '\xe9'
?
>>> print u'\xe9'
é

I get different result for the first one than what is described above. why is this discrepancy and where does latin-1 come to play in this picture?

669

asked Feb 19 '14 19:02

eagertoLearn

1 Answers

You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.

The print output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.

When a terminal is set to UTF-8, the \xe9 byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:

Click to copy

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é

If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph � instead:

Click to copy

>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�

That's because in UTF-8, \xe9 is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:

Click to copy

>>> print '\xe9\x80\x80'
退

because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.

If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

answered Oct 19 '22 23:10

Martijn Pieters

Related questions
                            
                                How to call a celery task delay function from non-python languages such as Java?
                            
                                "OSError: dlopen(libSystem.dylib, 6): image not found" (OS X + macports + Celery 3.1.7)
                            
                                get div from HTML with Python
                            
                                How to use the debugging tool in Spyder for python scripts?
                            
                                Finding a nonrecursive DOM subnode in Python using BeautifulSoup
                            
                                Why does X.dot(X.T) require so much memory in numpy?
                            
                                Flask: asynchronous response to client
                            
                                Speed up nested for loop with elements exponentiation
                            
                                BadStatusLine exception raised when returning reply from server in Python 3
                            
                                Creating custom string type in Python
                            
                                How to open a mp4 file with python?
                            
                                How can I store and print the top 20% feature names and scores?
                            
                                numpy array integer indexing in more than one dimension
                            
                                python pycurl get final url redirect
                            
                                How can I set maximum and minimum value in the color scale of contourf ?
                            
                                Python pandas removing SettingWithCopyWarning
                            
                                python requests not working with google app engine
                            
                                PyAudio 'utf8' error when listing devices
                            
                                pass an undefined method call to an attribute containing a different object
                            
                                Xml parsing from web response

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

latin-1 vs unicode in python

Tags:

python

unicode

utf-8

latin1

eagertoLearn

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us