Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URL Normalization - Removing the dot segment

How would you go about removing the dot segments in a URL?


1 Answers

To normalize URLs by removing the dot-segment, I would use this algorithm prescribed by RFC 3986:

5.2.4. Remove Dot Segments

The pseudocode also refers to a "remove_dot_segments" routine for
interpreting and removing the special "." and ".." complete path
segments from a referenced path. This is done after the path is
extracted from a reference, whether or not the path was relative, in
order to remove any invalid or extraneous dot-segments prior to
forming the target URI. Although there are many ways to accomplish
this removal process, we describe a simple method using two string
buffers.

  1. The input buffer is initialized with the now-appended path components and the output buffer is initialized to the empty string.

  2. While the input buffer is not empty, loop as follows:

    A. If the input buffer begins with a prefix of "../" or "./", then remove that prefix from the input buffer; otherwise,

    B. if the input buffer begins with a prefix of "/./" or "/.", where "." is a complete path segment, then replace that prefix with "/" in the input buffer; otherwise,

    C. if the input buffer begins with a prefix of "/../" or "/..", where ".." is a complete path segment, then replace that prefix with "/" in the input buffer and remove the last segment and its preceding "/" (if any) from the output buffer; otherwise,

    D. if the input buffer consists only of "." or "..", then remove that from the input buffer; otherwise,

    E. move the first path segment in the input buffer to the end of the output buffer, including the initial "/" character (if any) and any subsequent characters up to, but not including, the next "/" character or the end of the input buffer.

  3. Finally, the output buffer is returned as the result of remove_dot_segments.

Python implementation:

In [36]: path = '/../a/b/../c/./d.html'

In [37]: while '/..' in path:
    pos = path.find('/..')
    pos2 = path.rfind('/',0,pos)
    if pos2 != -1:
        path = path[:pos2]+path[pos+3:]
    else:
        path = path.replace('/..','',1)
   ....:         

In [38]: path = path.replace('/./','/')

In [39]: path = path.replace('/.','')

In [40]: path
Out[40]: '/a/c/d.html'

References:

  • URL normalization - Wikipedia, the free encyclopedia
  • RFC 3986

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!