Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spec justification for € to Ÿ in UTF-8 documents browser behaviour wanted

The HTML 4.01 spec says for hexadecimal character references

Numeric character references specify the code position of a character in the document character set.

So if the document character set encoding is UTF-8, the numeric references should specify a Unicode code point.

The HTML5 spec says for hexadecimal character references

The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more digits in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0061 LATIN SMALL LETTER A to U+0066 LATIN SMALL LETTER F, and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, representing a base-sixteen integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

No mention is made of the document character set, and it simply says that the numeric value identifies a Unicode code point.

But it seems that all the modern browsers (I haven't tested older ones) treat € through Ÿ as if they were referencing Windows-1252

For example, € displays , but U+0080 isn't the code point for , U+20AC is. And the Unicode code point for U+0080 is defined as PAD

€ also (correctly) displays .

Is this simply pragmatic behaviour by browsers or is there a justification in a specification that I'm missing?

[Note that decimal character references have the same behaviour. I've just used the hexadecimal ones for clarity and consistency.]

like image 801
Alohci Avatar asked Feb 23 '23 01:02

Alohci


1 Answers

I found the answer to my question. It's in the tokenization section of the parsing algorithm in HTML5 for consume a character reference, which defines the mapping for these characters.

like image 89
Alohci Avatar answered Apr 27 '23 02:04

Alohci