Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why? ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string
>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'
The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.
UPDATES
a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文
>>> b = u'中文'
>>> print b
ÖÐÎÄ
Decode String in C++The rule for encoding is: k[encoded_string], this indicates where the encoded_string inside the square brackets is being repeated exactly k times. We can assume that the original data does not contain any numeric characters and that digits are only for those repeat numbers, k.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
Decoding UTF-8 Strings in PythonTo decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
Yes, str.decode()
usually returns a unicode
string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.
Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:
>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'
The values are indeed not equal, because they are not the same text.
Again, it is important that you use the right codec; a different codec will result in very different results:
>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ
I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.
Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....'
Unicode literal value. When printing the encoded.decode('utf8')
unicode
result, Python once more auto-encodes the data to fit my terminal encoding.
To see what codec Python detected, print sys.stdin.encoding
:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.
I urge you to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
to gain a more complete understanding on how Unicode works, and how Python handles Unicode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With