Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Showing a latin 1 character in a UTF-8 page

Here is a test.html file, saved with my text editor in latin 1 format:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
è
</body>
</html>

If a view the file in chrome, the è character is showed as a question mark. I don't understand why: è is part of latin 1 and latin 1 is supposed to be compatible with (a subset of) utf-8 so the code for the character è shouldn't be the same in latin 1 and utf-8?

If I change the charset to ISO-8859-1 of course everything is fine.

Thanks

like image 928
Eugenio Avatar asked Oct 24 '25 20:10

Eugenio


1 Answers

You are confusing the notion of character sets / codepages with encoding. UTF-8 and ISO-8859-1 (Latin-1) are encodings, they are a system of how to represent characters in bytes, not a list of characters that you choose from.

You save the file as ISO-8859-1, so your file has 0xE8. You tell the browser that the file is encoded in UTF-8, so the browser tries to decode your file according to the rules of UTF-8. And 0xE8 is invalid in UTF-8.

When you tell the browser to decode it in ISO-8859-1, it works because 0xE8 is valid in ISO-8859-1 and a character is shown from the codepage of ISO-8859-1 according to the value of 0xE8.

Also, ISO-8859-1 is a subset of unicode (the "codepages" of the utf encodings), not UTF-8. What that means is that the first 256 characters in the codepage of ISO-8859-1 are the same characters as the first 256 characters in unicode.

And there's more. Browsers actually never use ISO-8859-1 to decode your page, but secretly use Windows-1252 instead. This has been also specified in the HTML-5 draft

like image 112
Esailija Avatar answered Oct 27 '25 12:10

Esailija