Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to make HTML5 work with DOMDocument?


I'm attempting to parse HTML code with DOMDocument, do stuff like changes to it, then assemble it back to a string which I send to the output.

But there a few issues regarding parsing, meaning that what I send to DOMDocument does not always come back in the same form :)

Here's a list:

  1. using ->loadHTML:

    • formats my document regardless of the preserveWhitespace and formatOutput settings (loosing whitespaces on preformatted text)
    • gives me errors when I have html5 tags like <header>, <footer> etc. But they can be supressed, so I can live with this.
    • produces inconsistent markup - for example if I add a <link ... /> element (with a self-closing tag), after parsing/saveHTML the output will be <link .. >
  2. using ->loadXML:

    • encodes entities like > from <style> or <script> tags: body > div becomes body &gt; div
    • all tags are closed the same way, for example <meta ... /> becomes <meta...></meta>; but this can be fixed with an regex.

I didn't try HTML5lib but I'd prefer DOMDocument instead of a custom parser for performance reasons


So like the Honeymonster mentioned using CDATA fixes the main problem with loadXML.

Is there any way I could prevent self closing of all empty HTML tags besides a certain set, without using regex?

Right now I have:

$html = $dom->saveXML($node);  $html = preg_replace_callback('#<(\w+)([^>]*)\s*/>#s', function($matches){         // ignore only these tags        $xhtml_tags = array('br', 'hr', 'input', 'frame', 'img', 'area', 'link', 'col', 'base', 'basefont', 'param' ,'meta');         // if a element that is not in the above list is empty,        // it should close like   `<element></element>` (for eg. empty `<title>`)        return in_array($matches[1], $xhtml_tags) ? "<{$matches[1]}{$matches[2]} />" : "<{$matches[1]}{$matches[2]}></{$matches[1]}>"; }, $html); 

which works but it will also do the replacements in the CDATA content, which I don't want...

like image 635
Alex Avatar asked May 23 '12 01:05


People also ask

How to use DOMDocument in PHP?

PHP | DOMDocument getElementsByTagName() Function The DOMDocument::getElementsByTagName() function is an inbuilt function in PHP which is used to return a new instance of class DOMNodeList which contains all the elements of local tag name.

What is Domdoc?

A DomDocument is a container (variable/object) for holding an XML document in your VBA code. Just as you use a String variable to hold a strings value, you can use a DomDocument to hold an XML document. (for a complete list of a DomDocuments properties, see halfway down this page)

What is loadHTML?

DOMDocument::loadHTMLThe function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.

1 Answers

Use html5lib. It can parse html5 and produce a DOMDocument. Example:

require_once '/path/to/HTML5/Parser.php'; $dom = HTML5_Parser::parse('<html><body>...'); 


like image 158
Francis Avila Avatar answered Oct 13 '22 02:10

Francis Avila