I'm simply trying to decode \uXXXX\uXXXX\uXXXX-like string. But I get an error:
$ python
Python 2.7.6 (default, Sep 9 2014, 15:04:36)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'\u041e\u043b\u044c\u0433\u0430'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
I'm Python newbie. What's a problem? Thanks!
Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.
In case you are facing ordinal not in range 128 error it is because you are converting unicode to encoded bytes using str, so to solve the problem you require to stop str and instead use . encode() to properly encode the strings. Syntax- str.encode(encoding="utf-8",errors="strict")
The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail. Encoding from unicode to str. >>>
Python is trying to be helpful. You cannot decode Unicode data, it is already decoded. So Python first will encode the data (using the ASCII codec) to get bytes to decode. It is this implicit encoding that fails.
If you have Unicode data, it only makes sense to encode to UTF-8, not decode:
>>> print u'\u041e\u043b\u044c\u0433\u0430'
Ольга
>>> u'\u041e\u043b\u044c\u0433\u0430'.encode('utf8')
'\xd0\x9e\xd0\xbb\xd1\x8c\xd0\xb3\xd0\xb0'
If you wanted a Unicode value, then using a Unicode literal (u'...'
) is all you needed to do. No further decoding is necessary.
The same implicit conversion takes place in the other direction; if you tried to encode a bytestring you'd trigger an implicit decoding:
>>> u'\u041e\u043b\u044c\u0433\u0430'.encode('utf8').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
you can set default encoding utf-8.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With