Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are named entities in HTML still necessary in the age of Unicode aware browsers?

I did a lot of PHP programming in the last years and one thing that keeps annoying me is the weak support for Unicode and multibyte strings (to be sure, natively there is none). For example, "htmlentities" seems to be a much used function in the PHP world and I found it to be absolutely annoying when you've put an effort into keeping every string localizable, only store UTF-8 in your database, only deliver UTF-8 webpages etc. Suddenly, somewhere between your database and the browser there's this hopelessly naive function pretending every byte is a character and messes everything up.

I would just love to just dump this kind of functions, they seem totally superfluous. Is it still necessary these days to write 'ä' instead of 'ä'? At least my Firefox seems perfectly happy to display even the strangest Asian glyphs as long as they're served in a proper encoding.

Update: To be more precise: Are named entities necessary for anything else than displaying HTML tags (as in "&lt;" for "<")

Update 2:

@Konrad: Are you saying that, no, named entities are not needed?

@Ross: But wouldn't it be better to sanitize user input when it's entered, to keep my output logic free from such issues? (assuming of course, that reliable sanitizing on input is possible - but then, if it isn't, can it be on output?)

like image 432
Hanno Fietz Avatar asked Aug 24 '08 16:08

Hanno Fietz


1 Answers

Named entities in "real" XHTML (i.e. with application/xhtml+xml, rather than the more frequently-used text/html compatibility mode) are discouraged. Aside from the five defined in XML itself (&lt;, &gt;, &amp;, &quot;, &apos;), they'd all have to be defined in the DTD of the particular DocType you're using. That means your browser has to explicitly support that DocType, which is far from a given. Numbered entities, on the other hand, obviously only require a lookup table to get the right Unicode character.

As for whether you need entities at all these days: you can pretty much expect any modern browser to support UTF-8. Therefore, as long as you can guarantee that the database, the markup and the web server all agree to serve that, ditch the entities.

like image 72
Sören Kuklau Avatar answered Sep 22 '22 19:09

Sören Kuklau