I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:
<a href="somepage.html?x=1&y=2">...</a>
One should write:
<a href="somepage.html?x=1&y=2">...</a>
Apparently, the former example shouldn't work, but browser error recovery means it does.
We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?
For example, to encode a URL with an ampersand character, use %24. However, in HTML, use either & or &, both of which would write out the ampersand in the HTML page.
In HTML, the ampersand character (“&”) declares the beginning of an entity reference (a special character). If you want one to appear in text on a web page you should use the encoded named entity “ & ”—more technical mumbo-jumbo at w3c.org.
A URL is composed from a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters ( "-" , "." , "_" , "~" ).
No difference. UTF-8 doesn't matter because & is reserved anyway. So use &.
It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:
The ampersand (&) may be left unescaped in more cases compared to HTML4.
In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.
In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:
<
, &
, EOF, or the additional allowed character (a "
or '
if the attribute value is quoted or a >
if not) ===> then the ampersand is just an ampersand, no worries;∉
.The last case is the one of interest to you since your example has:
<a href="somepage.html?x=1&y=2">...</a>
You have the character sequence
Now here is the part from the HTML5 spec that is relevant in your case, because y
is not a named entity reference:
If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.
You don't have a semicolon there, so you don't have a parse error.
Now suppose you had, instead,
<a href="somepage.html?x=1é=2">...</a>
which is different because é
is a named entity reference in HTML. In this case, the following rule kicks in:
If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
So there the =
makes it an error, because legacy browsers might get confused.
Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin
, part
, sum
, sub
) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.
It would be interesting to see what validators can do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With