Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP DOMDocument - get html source of BODY

I'm using PHP's DOMDocument to parse and normalize user-submitted HTML using the loadHTML method to parse the content then getting a well-formed result via saveHTML:

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML(); 
echo($well_formed);

This does a beautiful job of parsing the fragment and adding the appropriate closing tags. The problem is that I'm also getting a bunch of tags I don't want such as <!DOCTYPE>, <html>, <head> and <body>. I understand that every well-formed HTML document needs these tags, but the HTML fragment I'm normalizing is going to be inserted into an existing valid document.

like image 232
leepowers Avatar asked Feb 27 '10 00:02

leepowers


2 Answers

The quick solution to your problem is to use an xPath expression to grab the body.

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

A word of warning here. Sometimes loadHTML will throw a warning when it encounters certainly poorly formed HTML documents. If you're parsing those kind of HTML documents, you'll need to find a better html parser [self link warning].

like image 107
Alan Storm Avatar answered Sep 23 '22 17:09

Alan Storm


IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.

Instead, I would rather use something like HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

And, if you try your portion of code :

<div><p>Hello World

Using the demo page of HTMLPurifier, you get this clean HTML as an output :

<div><p>Hello World</p></div>

Much better, isn't it ? ;-)

(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)

like image 32
Pascal MARTIN Avatar answered Sep 19 '22 17:09

Pascal MARTIN