Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do HTML entity names with dec < 255 not require semicolon?

In a plain HTML document &pound (dec 163) renders as £ without needing the ;, whereas &oelig (dec 339) will only render a œ with the semicolon. It seems that every html entity with a decimal value under 255 will render without needing the semicolon, both in FireFox and Chrome.

What gives?

like image 518
bryc Avatar asked Sep 08 '13 22:09

bryc


People also ask

Are there semicolons in HTML?

In HTML, a semicolon is used to terminate a character entity reference, either named or numeric. The declarations of a style attribute in Cascading Style Sheets (CSS) are separated and terminated with semicolons.

What are the symbol entities Why are they required in HTML?

An HTML entity is a piece of text ("string") that begins with an ampersand ( & ) and ends with a semicolon ( ; ). Entities are frequently used to display reserved characters (which would otherwise be interpreted as HTML code), and invisible characters (like non-breaking spaces).

What is the format for character entity reference?

What is the format for character entity reference? Explanation: The format for character entity reference is &name; name is case-sensitive alphanumeric string and semicolon is necessary.


2 Answers

The reason is that historically the semicolon has been optional when an entity reference (or a character reference) is not immediately followed by a name character. So &pound? is OK since ? is not a name character (i.e., a character allowed in names), but &pound4 is not, since 4 is a name character, making pound4 the entity name (which is undefined in HTML, but might become defined some day). This rule is part of SGML legacy in HTML, one of the few things where browsers actually applied specialties of SGML.

It has, however, always been regarded as good practice to terminate entity references by a semicolon. XML, and hence XHTML, makes it even formally mandatory.

This is why current browser practices allow omission of semicolons as in “classic” HTML, but only for the limited set of character references denoting ISO Latin 1 characters, i.e. characters with Unicode number less than 256 in decimal (FF in hexadecimal). This was the original set of entity references, and therefore such references have widely been used without semicolon. So the practices are a compromise: they want to encourage into using the recommendable notation but not invalidate a bulk of old pages, still less to make browsers fail to render them properly.

The HTML5 drafts have had various positions on this, but e.g. HTML5 CR from 6 August 2013 requires the semicolon in all cases even in HTML syntax. Lack of semicolon is defined as a parse error, which means that error handling is well-defined (the entity shall be recognized), but browsers may still stop parsing at first parse error!

like image 104
Jukka K. Korpela Avatar answered Oct 09 '22 07:10

Jukka K. Korpela


Firstly, this is entirely up to how forgiving the browser/rendering engine wants to be, and is not a property of HTML: all entities must end in a semi-colon, or you have invalid syntax. (The WHATWG "HTML Living Standard" confusingly considers this semi-colon to be part of the name, making it seem optional in the Devloper Edition but the full Standard text/W3C HTML5 draft is clearer: "The name must be one that is terminated by a U+003B SEMICOLON character (;).")

Secondly, referring to a character as having a "decimal value" is ambiguous at best. 163 and 339 are the "code points" of those characters in Unicode, which would normally be expressed in hexadecimal. Other encodings would have different positions for those characters, which could also be expressed as a "decimal value" if you wanted.

Thirdly, my guess is that it is not so much to do with where they come in a particular encoding sequence, but how common they are - the full list is extremely long (→WHATWG/→W3C). There is a trade-off to be made in interpreting such invalid sequences, since a URL might contain unescaped ampersands, which then in turn look like unterminated entities (e.g. http://example.com/foo?bar=rab&oelig=gileo). So browsers are trying to tread that fine line and guess which mistake was probably made in a particular case.

like image 30
IMSoP Avatar answered Oct 09 '22 05:10

IMSoP