Spec justification for  to  in UTF-8 documents browser behaviour wanted

Question

The HTML 4.01 spec says for hexadecimal character references

Numeric character references specify the code position of a character in the document character set.

So if the document character set encoding is UTF-8, the numeric references should specify a Unicode code point.

The HTML5 spec says for hexadecimal character references

The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more digits in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, representing a base-sixteen integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

No mention is made of the document character set, and it simply says that the numeric value identifies a Unicode code point.

But it seems that all the modern browsers (I haven't tested older ones) treat  through  as if they were referencing Windows-1252

For example,  displays €, but U+0080 isn't the code point for €, U+20AC is. And the Unicode code point for U+0080 is defined as PAD

€ also (correctly) displays €.

Is this simply pragmatic behaviour by browsers or is there a justification in a specification that I'm missing?

[Note that decimal character references have the same behaviour. I've just used the hexadecimal ones for clarity and consistency.]

Alohci · Accepted Answer

I found the answer to my question. It's in the tokenization section of the parsing algorithm in HTML5 for consume a character reference, which defines the mapping for these characters.

Spec justification for  to  in UTF-8 documents browser behaviour wanted

Tags:

html

utf-8

windows-1252

character-reference

Alohci

1 Answers

Alohci

Recent Activity

Donate For Us

Spec justification for &#x80; to &#x9F; in UTF-8 documents browser behaviour wanted

Tags:

html

utf-8

windows-1252

character-reference

Alohci

1 Answers

Alohci

Related questions

Recent Activity

Donate For Us

Spec justification for to in UTF-8 documents browser behaviour wanted