Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

I use python 2.7 and I'm receiving a string from a server (not in unicode!). Inside that string I find text with unicode escape sequences. For example like this:

<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>

How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.

Edit: <\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u

The example text is meant in proper python syntax like this:

"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"

The desired output is in proper python syntax

"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
like image 934
evolution Avatar asked Mar 17 '23 08:03

evolution


2 Answers

Try

>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'

And then you can encode to utf8 as usual.

like image 92
Ella Sharakanski Avatar answered Mar 19 '23 04:03

Ella Sharakanski


Python does contain some special string codecs for cases like this.

In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python. (On which your program should be performing all textual operations) - Whenever you are outputting that text again, you convert it to utf-8 as usual:

rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")

If there are othe bytes outside the 32-127 range, the unicode_escape codec assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:

  1. decode the original string using utf-8
  2. encode back to latin1
  3. decode using "unicode_escape"
  4. work on the text
  5. encode back to utf-8
like image 39
jsbueno Avatar answered Mar 19 '23 03:03

jsbueno