Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make HTML5 work with DOMDocument?

Tags:

I'm attempting to parse HTML code with DOMDocument, do stuff like changes to it, then assemble it back to a string which I send to the output.

But there a few issues regarding parsing, meaning that what I send to DOMDocument does not always come back in the same form :)

Here's a list:

  1. using ->loadHTML:

    • formats my document regardless of the preserveWhitespace and formatOutput settings (loosing whitespaces on preformatted text)
    • gives me errors when I have html5 tags like <header>, <footer> etc. But they can be supressed, so I can live with this.
    • produces inconsistent markup - for example if I add a <link ... /> element (with a self-closing tag), after parsing/saveHTML the output will be <link .. >
  2. using ->loadXML:

    • encodes entities like > from <style> or <script> tags: body > div becomes body &gt; div
    • all tags are closed the same way, for example <meta ... /> becomes <meta...></meta>; but this can be fixed with an regex.

I didn't try HTML5lib but I'd prefer DOMDocument instead of a custom parser for performance reasons


Update:

So like the Honeymonster mentioned using CDATA fixes the main problem with loadXML.

Is there any way I could prevent self closing of all empty HTML tags besides a certain set, without using regex?

Right now I have:

$html = $dom->saveXML($node);  $html = preg_replace_callback('#<(\w+)([^>]*)\s*/>#s', function($matches){         // ignore only these tags        $xhtml_tags = array('br', 'hr', 'input', 'frame', 'img', 'area', 'link', 'col', 'base', 'basefont', 'param' ,'meta');         // if a element that is not in the above list is empty,        // it should close like   `<element></element>` (for eg. empty `<title>`)        return in_array($matches[1], $xhtml_tags) ? "<{$matches[1]}{$matches[2]} />" : "<{$matches[1]}{$matches[2]}></{$matches[1]}>"; }, $html); 

which works but it will also do the replacements in the CDATA content, which I don't want...

like image 635
Alex Avatar asked May 23 '12 01:05

Alex


People also ask

How to use DOMDocument in PHP?

PHP | DOMDocument getElementsByTagName() Function The DOMDocument::getElementsByTagName() function is an inbuilt function in PHP which is used to return a new instance of class DOMNodeList which contains all the elements of local tag name.

What is Domdoc?

A DomDocument is a container (variable/object) for holding an XML document in your VBA code. Just as you use a String variable to hold a strings value, you can use a DomDocument to hold an XML document. (for a complete list of a DomDocuments properties, see halfway down this page)

What is loadHTML?

DOMDocument::loadHTMLThe function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.


1 Answers

Use html5lib. It can parse html5 and produce a DOMDocument. Example:

require_once '/path/to/HTML5/Parser.php'; $dom = HTML5_Parser::parse('<html><body>...'); 

Documentation

like image 158
Francis Avila Avatar answered Oct 13 '22 02:10

Francis Avila