Possible Duplicate:
How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
How do convert unicode escape sequences to unicode characters in a python string
I have a string that contains unicode characters e.g. \u2026
etc. Somehow it is not received to me as unicode
, but is received as a str
. How do I convert it back to unicode?
>>> a="Hello\u2026" >>> b=u"Hello\u2026" >>> print a Hello\u2026 >>> print b Hello… >>> print unicode(a) Hello\u2026 >>>
So clearly unicode(a)
is not the answer. Then what is?
You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.
UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.
Unicode escapes only work in unicode strings, so this
a="\u2026"
is actually a string of 6 characters: '\', 'u', '2', '0', '2', '6'.
To make unicode out of this, use decode('unicode-escape')
:
a="\u2026" print repr(a) print repr(a.decode('unicode-escape')) ## '\\u2026' ## u'\u2026'
Decode it with the unicode-escape
codec:
>>> a="Hello\u2026" >>> a.decode('unicode-escape') u'Hello\u2026' >>> print _ Hello…
This is because for a non-unicode string the \u2026
is not recognised but is instead treated as a literal series of characters (to put it more clearly, 'Hello\\u2026'
). You need to decode the escapes, and the unicode-escape
codec can do that for you.
Note that you can get unicode
to recognise it in the same way by specifying the codec argument:
>>> unicode(a, 'unicode-escape') u'Hello\u2026'
But the a.decode()
way is nicer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With