Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I normalize/collapse paths or URLs in Python in OS independent way?

I tried to use os.normpath in order to convert http://example.com/a/b/c/../ to http://example.com/a/b/ but it doesn't work on Windows because it does convert the slash to backslash.

like image 633
bogdan Avatar asked Jan 25 '10 09:01

bogdan


3 Answers

Here is how to do it

>>> import urlparse
>>> urlparse.urljoin("ftp://domain.com/a/b/c/d/", "../..")
'ftp://domain.com/a/b/'
>>> urlparse.urljoin("ftp://domain.com/a/b/c/d/e.txt", "../..")
'ftp://domain.com/a/b/'    

Remember that urljoin consider a path/directory all until the last / - after this is the filename, if any.

Also, do not add a leading / to the second parameter, otherwise you will not get the expected result.

os.path module is platform dependent but for file paths using only slashes but not-URLs you could use posixpath,normpath.

like image 199
sorin Avatar answered Nov 14 '22 23:11

sorin


adopted from os module " - os.path is one of the modules posixpath, or ntpath", in your case explicitly using posixpath.

   >>> import posixpath
    >>> posixpath.normpath("/a/b/../c")
    '/a/c'
    >>> 
like image 32
Dyno Fu Avatar answered Nov 14 '22 23:11

Dyno Fu


Neither urljoin nor posixpath.normpath do the job properly. urljoin forces you to join with something, and doesn't handle absolute paths or excessive ..s correctly. posixpath.normpath collapses multiple slashes and removes trailing slashes, both of which are things URLs shouldn't do.


The following function resolves URLs completely, handling both .s and ..s, in a correct way according to RFC 3986.

try:
    # Python 3
    from urllib.parse import urlsplit, urlunsplit
except ImportError:
    # Python 2
    from urlparse import urlsplit, urlunsplit

def resolve_url(url):
    parts = list(urlsplit(url))
    segments = parts[2].split('/')
    segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]]
    resolved = []
    for segment in segments:
        if segment in ('../', '..'):
            if resolved[1:]:
                resolved.pop()
        elif segment not in ('./', '.'):
            resolved.append(segment)
    parts[2] = ''.join(resolved)
    return urlunsplit(parts)

You can then call it on a complete URL as following.

>>> resolve_url("http://example.com/dir/../../thing/.")
'http://example.com/thing/'

For more information on the considerations that have to be made when resolving URLs, see a similar answer I wrote earlier on the subject.

like image 25
obskyr Avatar answered Nov 14 '22 23:11

obskyr