Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML character entities and character encoding set

When including HTML entities in an HTML document, do the entities need to be from the same character encoding set that the document is specified to be using?

For example, if I am going to use the copyright sign in an HTML document that is specified as UTF-8, is it necessary to use the Unicode HTML entity (©) or is it okay to use other entities, such as the ASCII HTML entity (©)?

Please explain your answer. I am aware that it will "work", but is there a case where it will not work?

Thanks!

like image 693
Mike Moore Avatar asked Aug 29 '10 00:08

Mike Moore


2 Answers

© and © specify the same character - 169 is equivalent to hexadecimal A9. These both specify a copyright symbol. Character entities in HTML always refer to Unicode code points, this is covered in the HTML 4 Standard. Thus, even if your character set changes, your entities still refer to the same characters.

This also means that you can encode characters that don't actually appear within your character set of choice. I just created a document in the ISO-8859-1 character set, but it includes a Greek lambda. Also, ASCII is not able to directly encode a copyright symbol, but it can through character entities.

Edit: Reading the comments on the other answer, I want to clarify this a bit. If you are using UTF-8 as the character encoding for your document, you can, within the raw HTML source, write a copyright symbol just as-is. (You need to find some way to input it, of course: copy-paste being the usual.) UTF-8 will allow you to directly encode any symbol you want. ISO-8859-1 is much more limited, and ASCII even more so. For example, within my HTML, if my document is a UTF-8 document, I can do:

<p>Hi there. This document is ©2010. Good day!</p>

or:

<p>Hi there. This document is &#xA9;2010. Good day!</p>

or:

<p>Hi there. This document is &copy;2010. Good day!</p>

The first is only valid if the character set supports "©". The other two are always valid, but less readable. Whatever text editor you're using, if it is worth its weight, should be able to tell you what character set it is encoding the document in.

If you do this, you need to make sure your web server informs the client of the correct character set, or that your document declares it with something like:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

I've used UTF-8 there as an example. XHTML should have the character set in the opening <?xml ... ?> tag.

like image 140
Thanatos Avatar answered Oct 15 '22 23:10

Thanatos


The beauty of the UTF-8 encoding is that you can actually just include the binary character. You don't need to encode it as an entity at all. Thusly: ©

Oh, you just want to know the difference between the two entities? There is none. One describes the byte in Hex and the other in decimal.

like image 40
RibaldEddie Avatar answered Oct 15 '22 22:10

RibaldEddie