I have terminal.app set to accept utf-8 and in bash I can type unicode characters, copy and paste them, but if I start the python shell I can't and if I try to decode unicode I get errors:
>>> wtf = u'\xe4\xf6\xfc'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> wtf = u'\xe4\xf6\xfc'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
Anyone know what I'm doing wrong?
I think there is encode/decode confusion all over the place. You start with an unicode object:
u'\xe4\xf6\xfc'
This is an unicode object, the three characters are the unicode codepoints for "äöü". If you want to turn them into Utf-8, you have to encode them:
>>> u'\xe4\xf6\xfc'.encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
The resulting six characters are the Utf-8 representation of "äöü".
If you call decode(...)
, you try to interpret the characters as some encoding that still needs to be converted to unicode. Since it already is Unicode, this doesn't work. Your first call tries a Ascii to Unicode conversion, the second call a Utf-8 to Unicode conversion. Since u'\xe4\xf6\xfc'
is neither valid Ascii nor valid Utf-8 these conversion attempts fail.
Further confusion might come from the fact that '\xe4\xf6\xfc'
is also the Latin1/ISO-8859-1 encoding of "äöü". If you write a normal python string (without the leading "u" that marks it as unicode), you can convert it to an unicode object with decode('latin1')
:
>>> '\xe4\xf6\xfc'.decode('latin1')
u'\xe4\xf6\xfc'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With