All of these answers are now wrong, because as of PHP 5.4 and Libxml 2.6 loadHTML
now has a $option
parameter which instructs Libxml about how it should parse the content.
Therefore, if we load the HTML with these options
$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
when doing saveHTML()
there will be no doctype
, no <html>
, and no <body>
.
LIBXML_HTML_NOIMPLIED
turns off the automatic adding of implied html/body elementsLIBXML_HTML_NODEFDTD
prevents a default doctype being added when one is not found.
Full documentation about Libxml parameters is here
(Note that loadHTML
docs say that Libxml 2.6 is needed, but LIBXML_HTML_NODEFDTD
is only available in Libxml 2.7.8 and LIBXML_HTML_NOIMPLIED
is available in Libxml 2.7.7)
Just remove the nodes directly after loading the document with loadHTML():
# remove <!DOCTYPE
$doc->removeChild($doc->doctype);
# remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
The issue with the top answer is that LIBXML_HTML_NOIMPLIED
is unstable.
It can reorder elements (particularly, moving the top element's closing tag to the bottom of the document), add random p
tags, and perhaps a variety of other issues[1]. It may remove the html
and body
tags for you, but at the cost of unstable behavior. In production, that's a red flag. In short:
Don't use LIBXML_HTML_NOIMPLIED
. Instead, use substr
.
Think about it. The lengths of <html><body>
and </body></html>
are fixed and at both ends of the document - their sizes never change, and neither do their positions. This allows us to use substr
to cut them away:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
echo substr($dom->saveHTML(), 12, -15); // the star of this operation
(THIS IS NOT THE FINAL SOLUTION HOWEVER! See below for the complete answer, keep reading for context)
We cut 12
away from the start of the document because <html><body>
= 12 characters (<<>>+html+body
= 4+4+4), and we go backwards and cut 15 off the end because \n</body></html>
= 15 characters (\n+//+<<>>+body+html
= 1 + 2 + 4 + 4 + 4)
Notice that I still use LIBXML_HTML_NODEFDTD
omit the !DOCTYPE
from being included. First, this simplifies the substr
removal of the HTML/BODY tags. Second, we don't remove the doctype with substr
because we don't know if the 'default doctype
' will always be something of a fixed length. But, most importantly, LIBXML_HTML_NODEFDTD
stops the DOM parser from applying a non-HTML5 doctype to the document - which at least prevents the parser from treating elements it doesn't recognize as loose text.
We know for a fact that the HTML/BODY tags are of fixed lengths and positions, and we know that constants like LIBXML_HTML_NODEFDTD
are never removed without some type of deprecation notice, so the above method should roll well into the future, BUT...
...the only caveat is that the DOM implementation could change the way in HTML/BODY tags are placed within the document - for instance, removing the newline at the end of the document, adding spaces between the tags, or adding newlines.
This can be remedied by searching for the positions of the opening and closing tags for body
, and using those offsets as for our lengths to trim off. We use strpos
and strrpos
to find the offsets from the front and back, respectively:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
// PositionOf<body> + 6 = Cutoff offset after '<body>'
// 6 = Length of '<body>'
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
// ^ PositionOf</body> - LengthOfDocument = Relative-negative cutoff offset before '</body>'
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
In closing, a repeat of the final, future-proof answer:
$dom = new domDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$trim_off_front = strpos($dom->saveHTML(),'<body>') + 6;
$trim_off_end = (strrpos($dom->saveHTML(),'</body>')) - strlen($dom->saveHTML());
echo substr($dom->saveHTML(), $trim_off_front, $trim_off_end);
No doctype, no html tag, no body tag. We can only hope the DOM parser will receive a fresh coat of paint soon and we can more directly eliminate these unwanted tags.
Use saveXML()
instead, and pass the documentElement as an argument to it.
$innerHTML = '';
foreach ($document->getElementsByTagName('p')->item(0)->childNodes as $child) {
$innerHTML .= $document->saveXML($child);
}
echo $innerHTML;
http://php.net/domdocument.savexml
use DOMDocumentFragment
$html = 'what you want';
$doc = new DomDocument();
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($html);
$doc->appendChild($fragment);
echo $doc->saveHTML();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With