Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is u'string' the same as 'string'.decode('XXX')

Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why? ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string

>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'

The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.

UPDATES

a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文

>>> b = u'中文'
>>> print b
ÖÐÎÄ
like image 596
Joey.Z Avatar asked Jan 07 '14 14:01

Joey.Z


People also ask

What is decode string?

Decode String in C++The rule for encoding is: k[encoded_string], this indicates where the encoded_string inside the square brackets is being repeated exactly k times. We can assume that the original data does not contain any numeric characters and that digits are only for those repeat numbers, k.

What is string decode in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

What is decode (' UTF 8 ') in Python?

Decoding UTF-8 Strings in PythonTo decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

How do I get the Unicode of a character in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.


1 Answers

Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.

Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:

>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'

The values are indeed not equal, because they are not the same text.

Again, it is important that you use the right codec; a different codec will result in very different results:

>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ

I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.

Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.

To see what codec Python detected, print sys.stdin.encoding:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'

Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.

I urge you to read:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

to gain a more complete understanding on how Unicode works, and how Python handles Unicode.

like image 69
Martijn Pieters Avatar answered Sep 30 '22 19:09

Martijn Pieters