I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen
when it recovers the html page:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?
P.S.: The URLs are encoded in utf-8
URL encoding converts characters into a format that can be transmitted over the Internet. - w3Schools. So, "/" is actually a seperator, but "%2f" becomes an ordinary character that simply represents "/" character in element of your url.
You can use HTTPUtility. URLDecode to remove %20 and any other encoded characters. It won't actually remove it, but rather, replace it with a space, as that is what it represents. If you actually want it removed completely, you have to use replace.
Using urllib
package (import urllib
) :
From official documentation :
urllib.unquote(string)
Replace
%xx
escapes by their single-character equivalent.Example:
unquote('/%7Econnolly/')
yields'/~connolly/'
.
From official documentation :
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
[…]
Example:
unquote('/El%20Ni%C3%B1o/')
yields'/El Niño/'
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With