Is u'string' the same as 'string'.decode('XXX')

Tags:

Although the title is a question, the short answer is apparently no. I've tried in the shell. The real question is why? ps: string is some non-ascii characters like Chinese and XXX is the current encoding of string

>>> u'中文' == '中文'.decode('gbk')
False
//The first one is u'\xd6\xd0\xce\xc4' while the second one u'\u4e2d\u6587'

The example is above. I am using windows chinese simplyfied. The default encoding is gbk, so is the python shell. And I got the two unicode object unequal.

UPDATES

a = '中文'.decode('gbk')
>>> a
u'\u4e2d\u6587'
>>> print a
中文

>>> b = u'中文'
>>> print b
ÖÐÎÄ

596

asked Jan 07 '14 14:01

Joey.Z

1 Answers

Yes, str.decode() usually returns a unicode string, if the codec successfully can decode the bytes. But the values only represent the same text if the correct codec is used.

Your sample text is not using the right codec; you have text that is GBK encoded, decoded as Latin1:

>>> print u'\u4e2d\u6587'
中文
>>> u'\u4e2d\u6587'.encode('gbk')
'\xd6\xd0\xce\xc4'
>>> u'\u4e2d\u6587'.encode('gbk').decode('latin1')
u'\xd6\xd0\xce\xc4'

The values are indeed not equal, because they are not the same text.

Again, it is important that you use the right codec; a different codec will result in very different results:

>>> print u'\u4e2d\u6587'.encode('gbk').decode('latin1')
ÖÐÎÄ

I encoded the sample text to Latin-1, not GBK or UTF-8. Decoding may have succeeded, but the resulting text is not readable.

Note also that pasting non-ASCII characters only work because the Python interpreter has detected my terminal codec correctly. I can paste text from my browser into my terminal, which then passes the text to Python as UTF-8-encoded data. Because Python has asked the terminal what codec was used, it was able to decode back again from the u'....' Unicode literal value. When printing the encoded.decode('utf8') unicode result, Python once more auto-encodes the data to fit my terminal encoding.

To see what codec Python detected, print sys.stdin.encoding:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'

Similar decisions have to be made when dealing with different sources of text. Reading string literals from the source file, for example, requires that you either use ASCII only (and use escape codes for everything else), or provide Python with an explicit codec notation at the top of the file.

I urge you to read:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

to gain a more complete understanding on how Unicode works, and how Python handles Unicode.

answered Sep 30 '22 19:09

Martijn Pieters

Related questions
                            
                                How do I compile Pyparsing with Cython on WIndows?
                            
                                Get start and stop indexes of overlapping matches?
                            
                                python counting number of presence and absence of substrings in list of sequences
                            
                                How to efficiently concatenate many arange calls in numpy?
                            
                                MySQL 'IF EXISTS' command causes an error when used in python
                            
                                n-sphere coordinate system to Cartesian coordinate system
                            
                                Understanding an issue with the namedtuple typename and pickle in Python
                            
                                Cannot allocate 1.6 GB in Python
                            
                                "Invalid tag name" error when creating element with lxml in python
                            
                                What share mode is used when files are opened using open()
                            
                                Twisted: Waiting for subtasks to finish
                            
                                What does end=' ' exactly do?
                            
                                Error when trying to setting up the CKAN filestore with local storage: Permission Denied
                            
                                Use keyPressEvent to catch enter or return
                            
                                Adjust the distance only between two subplots in matplotlib
                            
                                Django, can I get reference objects included with a queryset
                            
                                Django model doesn't get saved to database inside Celery Task
                            
                                Force Selenium to wait for AngularJS
                            
                                Django how to get csrf_token value in the template
                            
                                Pandas group by will not work

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is u'string' the same as 'string'.decode('XXX')

Tags:

python

unicode

decode

Joey.Z

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us