Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decode escaped characters in URL

I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen when it recovers the html page:

http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh  

Is there a way to transform them back to their unescaped form in python?

P.S.: The URLs are encoded in utf-8

like image 343
Tony Avatar asked Nov 15 '11 13:11

Tony


People also ask

What is %2f in URL?

URL encoding converts characters into a format that can be transmitted over the Internet. - w3Schools. So, "/" is actually a seperator, but "%2f" becomes an ordinary character that simply represents "/" character in element of your url.

How do you remove %20 from a link?

You can use HTTPUtility. URLDecode to remove %20 and any other encoded characters. It won't actually remove it, but rather, replace it with a space, as that is what it represents. If you actually want it removed completely, you have to use replace.


1 Answers

Using urllib package (import urllib) :

Python 2.7

From official documentation :

urllib.unquote(string)

Replace %xx escapes by their single-character equivalent.

Example: unquote('/%7Econnolly/') yields '/~connolly/'.

Python 3

From official documentation :

urllib.parse.unquote(string, encoding='utf-8', errors='replace')

[…]

Example: unquote('/El%20Ni%C3%B1o/') yields '/El Niño/'.

like image 121
Ignacio Vazquez-Abrams Avatar answered Nov 15 '22 13:11

Ignacio Vazquez-Abrams