I am searching for a library function to normalize a URL in Python, that is to remove "./" or "../" parts in the path, or add a default port or escape special characters and so on. The result should be a string that is unique for two URLs pointing to the same web page. For example http://google.com
and http://google.com:80/a/../
shall return the same result.
I would prefer Python 3 and already looked through the urllib
module. It offers functions to split URLs but nothing to canonicalize them. Java has the URI.normalize()
function that does a similar thing (though it does not consider the default port 80 equal to no given port), but is there something like this is python?
This is what I use and it's worked so far. You can get urlnorm from pip.
Notice that I sort the query parameters. I've found this to be essential.
from urlparse import urlsplit, urlunsplit, parse_qsl
from urllib import urlencode
import urlnorm
def canonizeurl(url):
split = urlsplit(urlnorm.norm(url))
path = split[2].split(' ')[0]
while path.startswith('/..'):
path = path[3:]
while path.endswith('%20'):
path = path[:-3]
qs = urlencode(sorted(parse_qsl(split.query)))
return urlunsplit((split.scheme, split.netloc, path, qs, ''))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With