I have been experimenting for a while with Python 2.X and unicode. But I've reached a point where it doesn't make sense.
First problem:
Some code will clearly explain what I mean. The txt variable is here to simulate the pyqt4 translate function. Which returns a QString.
# -*- coding: utf-8 -*-
from PyQt4 import QtCore
txt = QtCore.QString(u'può essere / sarà / 日本語')
txtUnicode1 = unicode(txt, errors='replace')
txtUnicode2 = unicode(txt)
When print()-ing the two unicode strings, I get:
pu� essere / sar� / ???
può essere / sarà / 日本語
Surely I could get the same thing by using QString.__str__(), but my point is understanding, so for the sake of argument I would like to know:
Second problem:
I am trying to understand encoding/decoding.
>>> a = u'può essere / sarà / 日本'
>>> b = a.encode('utf-8')
>>> a
u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b
'pu\xc3\xb2 essere / sar\xc3\xa0 / \xe6\x97\xa5\xe6\x9c\xac'
>>> print a
può essere / sarà / 日本
>>> print b
può essere / sarà / 日本
Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.
The UnicodeEncodeError normally happens when encoding a unicode string into a certain coding. Since codings map only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail.
Let's fire up the old standby, IDLE, and see if we can replicate what you're seeing.
IDLE 1.1.4
>>> a = u'può essere / sarà / 日本'
Unsupported characters in input
>>> a = u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b = a.encode('utf-8')
>>> a
u'pu\xf2 essere / sar\xe0 / \u65e5\u672c'
>>> b
'pu\xc3\xb2 essere / sar\xc3\xa0 / \xe6\x97\xa5\xe6\x9c\xac'
>>> print a
può essere / sarà / 日本
>>> print b
può essere / sarà / 日本
Note that I see something different when I print b
. This is because my shell (IDLE) does not interpret a sequence of bytes as UTF-8 text, but rather uses my platform character encoding, cp1252.
Let's just double check this.
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Yup, that's why I get different behavior than you do. Because your sys.stdout.encoding is UTF-8. And that is why, despite a
and b
being completely different values, they display the same; your terminal interprets bytes as UTF-8.
So you might be wondering if we can convert our sequence of unicode characters a
into a sequence of bytes that can be displayed in IDLE
>>> c = a.encode('cp1252')
Traceback (most recent call last):
File "<pyshell#19>", line 1, in -toplevel-
c = a.encode('cp1252') #uses default encoding
File "C:\Python24\lib\encodings\cp1252.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 20-21: character maps to <undefined>
The answer is no; cp1252 does not support encoding Chinese characters as bytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With