Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode and `decode()` in Python

>>> a = "我"  # chinese  
>>> b = unicode(a,"gb2312")  
>>> a.__class__   
<type 'str'>   
>>> b.__class__   
<type 'unicode'>  # b is unicode
>>> a
'\xce\xd2'
>>> b
u'\u6211' 

>>> c = u"我"
>>> c.__class__
<type 'unicode'>  # c is unicode
>>> c
u'\xce\xd2'

b and c are all unicode, but >>> b outputs u'\u6211', and >>> c outputs u'\xce\xd2', why?

like image 411
Tanky Woo Avatar asked Apr 23 '12 08:04

Tanky Woo


People also ask

What is Unicode encode and decode?

Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. Decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters. The Unicode Standard assigns a code point (a number) to each character in every supported script.

What does decode (' UTF 8 ') do in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

What is unicode and ASCII in Python?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers. ASCII : It is a character encoding standard for electronic communication.


1 Answers

When you enter "我", the Python interpreter gets from the terminal a representation of that character in your local character set, which it stores in a string byte-for-byte because of the "". On my UTF-8 system, that's '\xe6\x88\x91'. On yours, it's '\xce\xd2' because you use GB2312. That explains the value of your variable a.

When you enter u"我", the Python interpreter doesn't know which encoding the character is in. What it does is pretty much the same as for an ordinary string: it stores the bytes of the character in a Unicode string, interpreting each byte as a Unicode codepoint, hence the wrong result u'\xce\xd2' (or, on my box, u'\xe6\x88\x91').

This problem only exists in the interactive interpreter. When you write Python scripts or modules, you can specify the encoding near the top and Unicode strings will come out right. E.g., on my system, the following prints the word liberté twice:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

print(u"liberté")
print("liberté")
like image 104
Fred Foo Avatar answered Oct 12 '22 12:10

Fred Foo