Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do I really need to encode '&' as '&'?

People also ask

When should data be encoded?

I always recommend to developers that they encode at the very last moment before they send the data to the external system.

Why HTML encoding is required?

HTML encoding ensures that text will be correctly displayed in the browser, not interpreted by the browser as HTML. For example, if a text string contains a less than sign (<) or greater than sign (>), the browser would interpret these characters as an opening or closing bracket of an HTML tag.

What is AMP in encoding?

In HTML, the ampersand character (“&”) declares the beginning of an entity reference (a special character). If you want one to appear in text on a web page you should use the encoded named entity “ &amp; ”—more technical mumbo-jumbo at w3c.org.

Is ampersand a UTF 8 character?

No difference. UTF-8 doesn't matter because & is reserved anyway. So use &amp;.


Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as &amp; and everything would be fine.

HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.

Keep this point in mind; if you're not escaping & to &amp;, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.

Please just escape your code. It will save you a lot of trouble in the future.


Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.

Encoding & as &amp; under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.

Compare the following: which is easier? Which is easier to bugger up?

Methodology 1

  1. Write some content which includes ampersand characters.
  2. Encode them all.

Methodology 2

(with a grain of salt, please ;) )

  1. Write some content which includes ampersand characters.
  2. On a case-by-case basis, look at each ampersand. Determine if:
  • It is isolated, and as such unambiguously an ampersand. eg. volt & amp
     > In that case don't bother encoding it.
  • It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. E.g., amp&volt
     >. In that case, don't bother encoding it.
  • It is not isolated, and ambiguous. E.g., volt&amp
     > Encode it.

??


HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "&copy=2" is still a problem, for example, since &copy; is the copyright symbol.

However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.


I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:

Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.

Specifically, if HTML5 does not require using &amp; in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.


Well, if it comes from user input then absolutely yes, for obvious reasons. Think if this very website didn't do it: the title of this question would show up as Do I really need to encode ‘&’ as ‘&’?

If it's just something like echo '<title>Dolce & Gabbana</title>'; then strictly speaking you don't have to. It would be better, but if you don't, no user will notice the difference.


Could you show us what your title actually is? When I submit

<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>Am I allowed loose & mpersands?</p>
</body>
</html>

to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...