Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens if <base href...> is set with a double slash?

Tags:

html

html-head

I like to understand how to use a <base href="" /> value for my web crawler, so I tested several combinations with major browsers and finally found something with double slashes I don't understand.

If you don't like to read everything jump to the test results of D and E. Demonstration of all tests:
http://gutt.it/basehref.php

Step by step my test results on calling http://example.com/images.html:

A - Multiple base href

<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • only the first <base> with href counts
  • a source starting with / targets the root
  • ../ goes one folder up

B - Without trailing slash

<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • <base href> ignores everything after the last slash so http://example.com/images becomes http://example.com/

C - How it should be

<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

Conclusion

  • Same result as in Test B as expected

D - Double Slash

<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

E - Double Slash with whitespace

<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

Both are not "valid" URLs, but real results of my web crawler. Please explain what happend in D and E that ../image.jpg could be found and why causes the whitespace a difference?

Only for your interest:

  • <base href="http://example.com//" /> is the same as Test C
  • <base href="http://example.com/ /" /> is completely different. Only ../image.jpg is found
  • <base href="a/" /> finds only /images/image.jpg
like image 550
mgutt Avatar asked Mar 18 '15 12:03

mgutt


People also ask

Can URLs have double slashes?

A double slash in the URL path is valid and will respond in the browser, but is typically unwelcome, as this could cause duplicate content issues if the CMS delivers the same content on two URLs (i.e. single slash and double slash).

How do you fix a double slash in URL?

If the double slash in the page's permalink is generated by your CMS, you might need to address your developer for help. If the URL with a double slash is indexed in Google or has incoming external links, you can set the proper 301 redirects to the corrected URL.

What is double slash in HTML?

The "two forward slashes" are a common shorthand for "request the referenced resource using whatever protocol is being used to load the current page".

What does double slash mean?

Particularly as a double slash in written work usually means "new line here". Follow this answer to receive notifications.


1 Answers

The behavior of base is explained in the HTML spec:

The base element allows authors to specify the document base URL for the purposes of resolving relative URLs.

As shown in your test A, if there are multiple base with href, the document base URL will be the first one.

Resolving relative URLs is done this way:

Apply the URL parser to url, with base as the base URL, with encoding as the encoding.

The URL parsing algorithm is defined in the URL spec.

It's too complex to be explained here in detail. But basically, this is what happens:

  • A relative URL starting with / is calculated with respect to base URL's host.
  • Otherwise, the relative URL is calculated with respect to base URL's last directory.
  • Be aware that if the base path doesn't end with /, the last part will be a file, not a directory.
  • ./ is the current directory
  • ../ goes one directory up

(Probably, "directory" and "file" are not the proper terminology in URLs)

Some examples:

  • http://example.com/images/a/./ is http://example.com/images/a/
  • http://example.com/images/a/../ is http://example.com/images/
  • http://example.com/images//./ is http://example.com/images//
  • http://example.com/images//../ is http://example.com/images/
  • http://example.com/images/./ is http://example.com/images/
  • http://example.com/images/../ is http://example.com/

Note that, in most cases, // will be like /. As said by @poncha,

Unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.

However, in general / / won't become //.

You can use the following snippet to resolve your list of relative URLs to absolute ones:

var bases = [
  "http://example.com/images/",
  "http://example.com/images",
  "http://example.com/",
  "http://example.com/images//",
  "http://example.com/images/ /"
];
var urls = [
  "/images/image.jpg",
  "image.jpg",
  "./image.jpg",
  "images/image.jpg",
  "/image.jpg",
  "../image.jpg"
];
function newEl(type, contents) {
  var el = document.createElement(type);
  if(!contents) return el;
  if(!(contents instanceof Array))
    contents = [contents];
  for(var i=0; i<contents.length; ++i)
    if(typeof contents[i] == 'string')
      el.appendChild(document.createTextNode(contents[i]))
    else if(typeof contents[i] == 'object') // contents[i] instanceof Node
      el.appendChild(contents[i])
  return el;
}
function emoticon(str) {
  return {
    'http://example.com/images/image.jpg': 'good',
    'http://example.com/images//image.jpg': 'neutral'
  }[str] || 'bad';
}
var base = document.createElement('base'),
    a = document.createElement('a'),
    output = document.createElement('ul'),
    head = document.getElementsByTagName('head')[0];
head.insertBefore(base, head.firstChild);
for(var i=0; i<bases.length; ++i) {
  base.href = bases[i];
  var test = newEl('li', [
    'Test ' + (i+1) + ': ',
    newEl('span', bases[i])
  ]);
  test.className = 'test';
  var testItems = newEl('ul');
  testItems.className = 'test-items';
  for(var j=0; j<urls.length; ++j) {
    a.href = urls[j];
    var absURL = a.cloneNode(false).href;
      /* Stupid old IE requires cloning
         https://stackoverflow.com/a/24437713/1529630 */
    var testItem = newEl('li', [
      newEl('span', urls[j]),
      ' → ',
      newEl('span', absURL)
    ]);
    testItem.className = 'test-item ' + emoticon(absURL);
    testItems.appendChild(testItem);
  }
  test.appendChild(testItems);
  output.appendChild(test);
}
document.body.appendChild(output);
span {
  background: #eef;
}
.test-items {
  display: table;
  border-spacing: .13em;
  padding-left: 1.1em;
  margin-bottom: .3em;
}
.test-item {
  display: table-row;
  position: relative;
  list-style: none;
}
.test-item > span {
  display: table-cell;
}
.test-item:before {
  display: inline-block;
  width: 1.1em;
  height: 1.1em;
  line-height: 1em;
  text-align: center;
  border-radius: 50%;
  margin-right: .4em;
  position: absolute;
  left: -1.1em;
  top: 0;
}
.good:before {
  content: ':)';
  background: #0f0;
}
.neutral:before {
  content: ':|';
  background: #ff0;
}
.bad:before {
  content: ':(';
  background: #f00;
}

You can also play with this snippet:

var resolveURL = (function() {
  var base = document.createElement('base'),
      a = document.createElement('a'),
      head = document.getElementsByTagName('head')[0];
  return function(url, baseurl) {
    if(base) {
      base.href = baseurl;
      head.insertBefore(base, head.firstChild);
    }
    a.href = url;
    var abs = a.cloneNode(false).href;
    /* Stupid old IE requires cloning
       https://stackoverflow.com/a/24437713/1529630 */
    if(base)
      head.removeChild(base);
    return abs;
  };
})();
var base = document.getElementById('base'),
    url = document.getElementById('url'),
    abs = document.getElementById('absolute');
base.onpropertychange = url.onpropertychange = function() {
  if (event.propertyName == "value")
    update()
};
(base.oninput = url.oninput = update)();
function update() {
  abs.value = resolveURL(url.value, base.value);
}
label {
  display: block;
  margin: 1em 0;
}
input {
  width: 100%;
}
<label>
  Base url:
  <input id="base" value="http://example.com/images//foo////bar/baz"
         placeholder="Enter your base url here" />
</label>
<label>
  URL to be resolved:
  <input id="url" value="./a/b/../c"
         placeholder="Enter your URL here">
</label>
<label>
  Resulting url:
  <input id="absolute" readonly>
</label>
like image 114
Oriol Avatar answered Nov 15 '22 10:11

Oriol