Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Canonicalize / normalize a URL?

I am searching for a library function to normalize a URL in Python, that is to remove "./" or "../" parts in the path, or add a default port or escape special characters and so on. The result should be a string that is unique for two URLs pointing to the same web page. For example http://google.com and http://google.com:80/a/../ shall return the same result.

I would prefer Python 3 and already looked through the urllib module. It offers functions to split URLs but nothing to canonicalize them. Java has the URI.normalize() function that does a similar thing (though it does not consider the default port 80 equal to no given port), but is there something like this is python?

like image 344
XZS Avatar asked May 14 '12 14:05

XZS


1 Answers

This is what I use and it's worked so far. You can get urlnorm from pip.

Notice that I sort the query parameters. I've found this to be essential.

from urlparse import urlsplit, urlunsplit, parse_qsl
from urllib import urlencode
import urlnorm

def canonizeurl(url):
    split = urlsplit(urlnorm.norm(url))
    path = split[2].split(' ')[0]

    while path.startswith('/..'):
        path = path[3:]

    while path.endswith('%20'):
        path = path[:-3]

    qs = urlencode(sorted(parse_qsl(split.query)))
    return urlunsplit((split.scheme, split.netloc, path, qs, ''))
like image 98
stuckintheshuck Avatar answered Oct 05 '22 18:10

stuckintheshuck