Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python string to unicode [duplicate]

Possible Duplicate:
How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
How do convert unicode escape sequences to unicode characters in a python string

I have a string that contains unicode characters e.g. \u2026 etc. Somehow it is not received to me as unicode, but is received as a str. How do I convert it back to unicode?

>>> a="Hello\u2026" >>> b=u"Hello\u2026" >>> print a Hello\u2026 >>> print b Hello… >>> print unicode(a) Hello\u2026 >>>  

So clearly unicode(a) is not the answer. Then what is?

like image 868
prongs Avatar asked Apr 22 '12 13:04

prongs


People also ask

How do you make a string containing Unicode characters in Python?

You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.

Is UTF-8 Unicode?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.

How do I get Unicode in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

Does Python support Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.


2 Answers

Unicode escapes only work in unicode strings, so this

 a="\u2026" 

is actually a string of 6 characters: '\', 'u', '2', '0', '2', '6'.

To make unicode out of this, use decode('unicode-escape'):

a="\u2026" print repr(a) print repr(a.decode('unicode-escape'))  ## '\\u2026' ## u'\u2026' 
like image 115
georg Avatar answered Oct 16 '22 10:10

georg


Decode it with the unicode-escape codec:

>>> a="Hello\u2026" >>> a.decode('unicode-escape') u'Hello\u2026' >>> print _ Hello… 

This is because for a non-unicode string the \u2026 is not recognised but is instead treated as a literal series of characters (to put it more clearly, 'Hello\\u2026'). You need to decode the escapes, and the unicode-escape codec can do that for you.

Note that you can get unicode to recognise it in the same way by specifying the codec argument:

>>> unicode(a, 'unicode-escape') u'Hello\u2026' 

But the a.decode() way is nicer.

like image 43
Chris Morgan Avatar answered Oct 16 '22 09:10

Chris Morgan