I have a unicode string like "Tanım" which is encoded as "Tan%u0131m" somehow. How can i convert this encoded string back to original unicode. Apparently urllib.unquote does not support unicode.
In python, to remove Unicode character from string python we need to encode the string by using str. encode() for removing the Unicode characters from the string.
Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.
The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:
>>> urllib2.unquote("%0a") '\n'
Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.
A more complete example:
>>> u"Tanım" u'Tan\u0131m' >>> url = urllib.quote(u"Tanım".encode('utf8')) >>> urllib.unquote(url).decode('utf8') u'Tan\u0131m'
def unquote(text): def unicode_unquoter(match): return unichr(int(match.group(1),16)) return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With