>>> a = "我" # chinese
>>> b = unicode(a,"gb2312")
>>> a.__class__
<type 'str'>
>>> b.__class__
<type 'unicode'> # b is unicode
>>> a
'\xce\xd2'
>>> b
u'\u6211'
>>> c = u"我"
>>> c.__class__
<type 'unicode'> # c is unicode
>>> c
u'\xce\xd2'
b
and c
are all unicode, but >>> b
outputs u'\u6211'
, and >>> c
outputs u'\xce\xd2'
, why?
Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. Decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters. The Unicode Standard assigns a code point (a number) to each character in every supported script.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers. ASCII : It is a character encoding standard for electronic communication.
When you enter "我"
, the Python interpreter gets from the terminal a representation of that character in your local character set, which it stores in a string byte-for-byte because of the ""
. On my UTF-8 system, that's '\xe6\x88\x91'
. On yours, it's '\xce\xd2'
because you use GB2312. That explains the value of your variable a
.
When you enter u"我"
, the Python interpreter doesn't know which encoding the 我
character is in. What it does is pretty much the same as for an ordinary string: it stores the bytes of the character in a Unicode string, interpreting each byte as a Unicode codepoint, hence the wrong result u'\xce\xd2'
(or, on my box, u'\xe6\x88\x91'
).
This problem only exists in the interactive interpreter. When you write Python scripts or modules, you can specify the encoding near the top and Unicode strings will come out right. E.g., on my system, the following prints the word liberté twice:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u"liberté")
print("liberté")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With