Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to unquote a urlencoded unicode string in python?

I have a unicode string like "Tanım" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode. Apparently urllib.unquote does not support unicode.

like image 227
hamdiakoguz Avatar asked Nov 18 '08 22:11

hamdiakoguz


People also ask

How do I ignore unicode in Python?

In python, to remove Unicode character from string python we need to encode the string by using str. encode() for removing the Unicode characters from the string.

What does unicode () do in Python?

Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

What does encoding =' UTF 8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.


2 Answers

%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.

The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:

>>> urllib2.unquote("%0a") '\n' 

Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.

A more complete example:

>>> u"Tanım" u'Tan\u0131m' >>> url = urllib.quote(u"Tanım".encode('utf8')) >>> urllib.unquote(url).decode('utf8') u'Tan\u0131m' 
like image 200
Aaron Maenpaa Avatar answered Sep 22 '22 21:09

Aaron Maenpaa


def unquote(text):     def unicode_unquoter(match):         return unichr(int(match.group(1),16))     return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text) 
like image 38
Markus Jarderot Avatar answered Sep 22 '22 21:09

Markus Jarderot