Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't decode utf-8 string in python on os x terminal.app

I have terminal.app set to accept utf-8 and in bash I can type unicode characters, copy and paste them, but if I start the python shell I can't and if I try to decode unicode I get errors:

>>> wtf = u'\xe4\xf6\xfc'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> wtf = u'\xe4\xf6\xfc'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Anyone know what I'm doing wrong?

like image 590
Bjorn Avatar asked Nov 27 '22 19:11

Bjorn


1 Answers

I think there is encode/decode confusion all over the place. You start with an unicode object:

u'\xe4\xf6\xfc'

This is an unicode object, the three characters are the unicode codepoints for "äöü". If you want to turn them into Utf-8, you have to encode them:

>>> u'\xe4\xf6\xfc'.encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'

The resulting six characters are the Utf-8 representation of "äöü".

If you call decode(...), you try to interpret the characters as some encoding that still needs to be converted to unicode. Since it already is Unicode, this doesn't work. Your first call tries a Ascii to Unicode conversion, the second call a Utf-8 to Unicode conversion. Since u'\xe4\xf6\xfc' is neither valid Ascii nor valid Utf-8 these conversion attempts fail.

Further confusion might come from the fact that '\xe4\xf6\xfc' is also the Latin1/ISO-8859-1 encoding of "äöü". If you write a normal python string (without the leading "u" that marks it as unicode), you can convert it to an unicode object with decode('latin1'):

>>> '\xe4\xf6\xfc'.decode('latin1')
u'\xe4\xf6\xfc'
like image 55
sth Avatar answered Feb 12 '23 09:02

sth