Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument::loadXML vs. HTML Entities

I currently have a problem reading in XHTML as the XML parser doesn't recognise HTML character entities so:

<?php
$text = <<<EOF
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Entities are Causing Me Problems</title>
  </head>
  <body>
    <p>Copyright &copy; 2010 Some Bloke</p>
  </body>
</html>
EOF;

$imp = new DOMImplementation ();
$html5 = $imp->createDocumentType ('html', '', '');
$doc = $imp->createDocument ('http://www.w3.org/1999/xhtml', 'html', $html5);

$doc->loadXML ($text);

header ('Content-Type: application/xhtml+xml; charset: utf-8');
echo $doc->saveXML ();

Results in:

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Entity 'copy' not defined in Entity, line: 8 in testing.php on line 19

How can I fix this while allowing myself to serve pages as XHTML5?

like image 966
casr Avatar asked Feb 14 '10 17:02

casr


2 Answers

XHTML5 does not have a DTD, so you may not use the old-school HTML named entities in it, as there is no document type definition to tell the parser what the named entities are for this language. (Except for the predefined XML entities &lt;, &amp;, &quot; and &gt;... and &apos;, though you generally don't want to use that).

Instead use a numeric character reference (&#169;) or, better, just a plain unencoded © character (in UTF-8; remember to include the <meta> element to signify the character set to non-XML parsers).

like image 153
bobince Avatar answered Sep 25 '22 09:09

bobince


Try using DOMDocument::loadHTML() instead. It doesn't choke on imperfect markup.

like image 32
Xorlev Avatar answered Sep 22 '22 09:09

Xorlev