XHTML5 and HTML4 character entities

Tags:

Does XHTML5 support character entities such as   and —. At work we can require specific software to access the admin side of the site, and people are demanding multi-file-upload. For me this is an easy justification to require migrating to FF 3.6+, so I'll be doing it soonish. We currently use XHTML 1.1, and upon moving to HTML5, I'm only having issues with character entity names... Does anyone have a doc on this?

I see there is a list on the WHATWG spec but I'm not sure if it affects files served as application/xhtml+xml. By any means the two mentioned trigger errors in both Chromium nightly and FF 3.6.

419

asked Jul 09 '10 17:07

NO WAR WITH RUSSIA

3 Answers

There is no DTD for XHTML5, so an XML parser will see no entity definitions (other than the predefined ones). If you wanted to use an entity you would have to define it for yourself in the internal subset.

<!DOCTYPE html [
    <!ENTITY mdash "—">
]>
<html xmlns="http://www.w3.org/1999/xhtml">
    ... &mdash; ...
</html>

(Of course using the internal subset is likely to trip browsers up if you serve it to them as text/html. Sending an internal subset in a non-XHTML HTML5 document is disallowed.)

The HTML5 wiki currently recommends:

Do not use entity references in XHTML (except for the 5 predefined entities: &, <, >, " and ')

And I agree with this advice not just for XHTML5 but for XML and HTML in general. There's little reason to be using the HTML entities for anything today. Unicode characters typed directly are far more readable for everyone, and &#...; character references are available for those sad cases when you can't guarantee a 8-bit/encoding-clean transport. (Since HTML entities are not defined for the majority of Unicode characters, you are going to need those anyway.)

answered Oct 13 '22 00:10

bobince

I needed an XML validation of potentially HTML 5. HTML 4 and XHTML only had a mediocre 250 or so entities, while the current draft (January 2012) has more than 2000.

GET 'http://www.w3.org/TR/html5-author/named-character-references.html' |
xmllint --html --xmlout --format --noent - | 
egrep '<code|<span.*glyph' |  # get only the bits we're interested in
sed -e 's/.*">/__/' | # Add some "__" markers to make e.g. whitespace
sed -e 's/<.*/__/' |  #  entities work with xargs
sed 's/"/\&quot;/' | # xmllint output contains " which messes up xargs
sed "s/'/\&apos;/" | # ditto apostrophes. Make them HTML entities instead.
xargs -n 2 echo |  # Put the entity names and values on one line
sed 's/__/<!ENTITY /' | # Make a DTD
sed 's/;__/ /' |
sed 's/ __/"/'  |
sed 's/__$/">/' |
egrep -v '\bapos\b|\bquot\b|\blt\b|\bgt\b|\bamp\b' # remove XML entities.

You end up with a file containing 2114 entities.

<!ENTITY AElig "&#xC6;">
<!ENTITY Aacute "&#xC1;">
<!ENTITY Abreve "&#x102;">
<!ENTITY Acirc "&#xC2;">
<!ENTITY Acy "&#x410;">
<!ENTITY Afr "&#x1D504;">

Plugging this into an XML parser should allow the XML parser to resolve these character entities.

Update October 2012: Since the working draft now has a JSON file (yes, I'm still using regular expressions) I worked it down to a single sed:

curl -s 'http://www.w3.org/TR/html5-author/entities.json' |
sed -n '/^  "&/s/"&\([^;"]*\)[^0-9]*\[\([0-9]*\)\].*/<!ENTITY \1 "\&#\2;">/p' |
uniq

Of course a javascript equivalent would be a lot more robust, but not everyone has node installed. Everyone has sed, right? Random sample output:

<!ENTITY subsetneqq "&#10955;">
<!ENTITY subsim "&#10951;">
<!ENTITY subsub "&#10965;">
<!ENTITY subsup "&#10963;">
<!ENTITY succapprox "&#10936;">
<!ENTITY succ "&#8827;">

answered Oct 13 '22 02:10

mogsie

The right answer (the modern way)

I asked this question five years ago. Now every browser supports UTF-8. And, every inception of UTF-8 includes glyph support for all named character entities. The rightmost current solution to this problem is not to use named entities at all but to serve only UTF-8 (strict) and to use actually characters in that.

This is a list of all XML entities. All of these have UTF-8 character alternatives -- and that's how they'd normally be rendered anyway.

For instance, take

U+1D6D8, MATHEMATICAL BOLD SMALL CHI            , b.chi

I suppose in some variant of xml you could have &b.chi or something, searching for MATHEMATICAL BOLD SMALL CHI you'll find some page on fileformat.info, which has the 𝛘 character listed.

Alternatively, in Windows you can type Alt + 1 D 6 D 8 (the 1d68d comes from the table of XML entities), or in Linux Ctrl + Shift + u 1 D 6 D 8.

This will put the character into your document the right way.

answered Oct 13 '22 01:10

NO WAR WITH RUSSIA

Related questions
                            
                                Why × html entity is &times;
                            
                                PHP: Make a string upper case but not the html entities in it?
                            
                                How can I prevent Mojolicious from character-escaping stash data?
                            
                                PostgreSQL replace HTML entities function
                            
                                htmlentites not working for emoji
                            
                                Find/Replace htmlentities using the standard linux toolchain?
                            
                                HTML character entity references for maximize and minimize
                            
                                PHP htmlspecialchars is not working [closed]
                            
                                Decode Numeric HTML Entities in ColdFusion?
                            
                                Decode HTML entities in JavaScript?
                            
                                URL-encoding and HTML-encoding NSStrings
                            
                                How to get attr with raw entities?
                            
                                Can I use unencoded ampersands (&) in html? [duplicate]
                            
                                How do I dynamically create an <option> in JavaScript that contains an HTML entity (— ... «)?
                            
                                How to show Unicode characters in IE using HTML
                            
                                var_dump or print_r and html encoding
                            
                                Inverse htmlentities / html_entity_decode
                            
                                HTML entity name for Backward slash
                            
                                AngularJs: How to decode HTML entities in HTML? [duplicate]
                            
                                How to compare an html entity with jQuery

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

XHTML5 and HTML4 character entities

Tags:

html

html-entities

NO WAR WITH RUSSIA

People also ask

3 Answers

bobince

mogsie

The right answer (the modern way)

NO WAR WITH RUSSIA

Recent Activity

Donate For Us