I'm having a few issues trying to encode a string to UTF-8. I've tried numerous things, including using string.encode('utf-8')
and unicode(string)
, but I get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
This is my string:
(。・ω・。)ノ
I don't see what's going wrong, any idea?
Edit: The problem is that printing the string as it is does not show properly. Also, this error when I try to convert it:
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89' >>> s1 = s.decode('utf-8') >>> print s1 Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.
This is to do with the encoding of your terminal not being set to UTF-8. Here is my terminal
$ echo $LANG en_GB.UTF-8 $ python Python 2.7.3 (default, Apr 20 2012, 22:39:59) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89' >>> s1 = s.decode('utf-8') >>> print s1 (。・ω・。)ノ >>>
On my terminal the example works with the above, but if I get rid of the LANG
setting then it won't work
$ unset LANG $ python Python 2.7.3 (default, Apr 20 2012, 22:39:59) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89' >>> s1 = s.decode('utf-8') >>> print s1 Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128) >>>
Consult the docs for your linux variant to discover how to make this change permanent.
try:
string.decode('utf-8') # or: unicode(string, 'utf-8')
edit:
'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'.decode('utf-8')
gives u'(\uff61\uff65\u03c9\uff65\uff61)\uff89'
, which is correct.
so your problem must be at some oter place, possibly if you try to do something with it were there is an implicit conversion going on (could be printing, writing to a stream...)
to say more we'll need to see some code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With