loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Question

Using the LIBXML_HTML_NOIMPLIED flag with an html fragment generates incorrect tags:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();

Outputs:

<p>Lorem ipsum dolor sit amet.<p>Nunc vel vehicula ante.</p></p>

I have found hacks to work around this using regexes, but that defeats the purpose of using DOM. I have tested this with several versions of libxml and php, the latest with libxml 2.9.2, php 5.6.7 (Debian Jessy). Any suggestions appreciated.

hakre · Accepted Answer

The re-arrangement is done by the LIBXML_HTML_NOIMPLIED option you're using. Looks like it's not stable enough for your case.

Also you might want to not use it for portablility reasons, for example I've got one PHP 5.4.36 with Libxml 2.7.8 at hand that is not supporting LIBXML_HTML_NOIMPLIED (Libxml >= 2.7.7) but later LIBXML_HTML_NODEFDTD (Libxml >= 2.7.8) option.

I know this way of dealing with it. When you load the fragment, you wrap it into a <div> element:

$doc->loadHTML("<div>$str</div>");

This helps to guide DOMDocument on the structure you want.

You can then extract this container from the document itself:

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);

And then remove all children from the document:

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

Now the document is completely empty and you're now able to append children again. Luckily there is the <div> container element we removed earlier, so we can add from it:

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

The fragment then can be retrieved with the known saveHTML method:

echo $doc->saveHTML();

Which gives in your scenario:

<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>

This methodology is a little different from the existing material here on site (see the references I give below), so the example at once:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';

$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

echo $doc->saveHTML();

I also really recommend the reference question on How to saveHTML of DOMDocument without HTML wrapper? for a further read as well as the one about inner-html

References

How to saveHTML of DOMDocument without HTML wrapper?
How to get innerHTML of DOMNode?

Nicholas Shanks · Answer

The LIBXML_HTML_NOIMPLIED option is not buggy, it's just badly documented. To fix the problem, wrap your input string with <html>…</html>, process your HTML, and then strip that off the output. LibXML requires a root node, and is treating the first element it finds as the root node, deleting the (incorrectly located) closing tag it finds half-way through, and then outputting the closing tag of the first element it found at the end of the document. It's logical when you see it from (Lib)XML's perspective.

loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Tags:

html

php

domdocument

J.L. Hill

2 Answers

References

hakre

Nicholas Shanks

Recent Activity

Donate For Us

loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Tags:

html

php

domdocument

J.L. Hill

2 Answers

References

hakre

Nicholas Shanks

Related questions

Recent Activity

Donate For Us