Unicode and `decode()` in Python

Tags:

>>> a = "我"  # chinese  
>>> b = unicode(a,"gb2312")  
>>> a.__class__   
<type 'str'>   
>>> b.__class__   
<type 'unicode'>  # b is unicode
>>> a
'\xce\xd2'
>>> b
u'\u6211' 

>>> c = u"我"
>>> c.__class__
<type 'unicode'>  # c is unicode
>>> c
u'\xce\xd2'

b and c are all unicode, but >>> b outputs u'\u6211', and >>> c outputs u'\xce\xd2', why?

411

asked Apr 23 '12 08:04

Tanky Woo

1 Answers

When you enter "我", the Python interpreter gets from the terminal a representation of that character in your local character set, which it stores in a string byte-for-byte because of the "". On my UTF-8 system, that's '\xe6\x88\x91'. On yours, it's '\xce\xd2' because you use GB2312. That explains the value of your variable a.

When you enter u"我", the Python interpreter doesn't know which encoding the 我 character is in. What it does is pretty much the same as for an ordinary string: it stores the bytes of the character in a Unicode string, interpreting each byte as a Unicode codepoint, hence the wrong result u'\xce\xd2' (or, on my box, u'\xe6\x88\x91').

This problem only exists in the interactive interpreter. When you write Python scripts or modules, you can specify the encoding near the top and Unicode strings will come out right. E.g., on my system, the following prints the word liberté twice:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u"liberté")
print("liberté")

104

answered Oct 12 '22 12:10

Fred Foo

Related questions
                            
                                easy_install with pypy while Python is installed
                            
                                OCR of low-resolution text from screenshots
                            
                                How to implement a signal/slot defined in Qt Designer
                            
                                Is it possible to "dynamically" create local variables in Python? [duplicate]
                            
                                What controls automated window resizing in Tkinter?
                            
                                Using grep in python
                            
                                How to create a dual-authentication HTTPS client in Python without (L)GPL libs?
                            
                                Python - How can you use a module's alias to import its submodules?
                            
                                How do you verify duck-typed interfaces in python?
                            
                                Running interactive python script from emacs
                            
                                Control Font in tkMessageBox
                            
                                Connect to User Model in Django
                            
                                How to save ctypes objects containing pointers
                            
                                How to join MongoDB collections in Python?
                            
                                Should Python library modules start with #!/usr/bin/env python?
                            
                                How to get a SQL Server stored procedure return value using pyodbc?
                            
                                Pandas pivot_table on date
                            
                                Python Recursion through objects and child objects, Print child depth numbers
                            
                                Why does this implementation of izip() not work?
                            
                                Django Deploy using Heroku - [Errno 2] No such file or directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode and `decode()` in Python

Tags:

python

unicode

decode

codec

Tanky Woo

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us