Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP DOMDocument adds extra tags

I'm trying to parse a document and get all the image tags and change the source for something different.

$domDocument = new DOMDocument();

$domDocument->loadHTML($text);

$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
  $Image->setAttribute('src', 'lalala');
  $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();

The $text initially looks like this:

<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>

and this is the output $text:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hi, this is a test, here is an image<img src="lalala" width="68" height="95"> Because I like Beer!</p></body></html>

I'm getting a bunch of extra tags (HTML, body, and the comment at the top) that I don't really need. Any way to set up the DOMDocument to avoid adding these extra tags?

like image 795
Onema Avatar asked Jan 26 '11 00:01

Onema


1 Answers

You just need to add 2 flags to the loadHTML() method: LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD. I.e.

$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);

See IDEONE demo:

$text = '<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>';
$domDocument = new DOMDocument;
$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
      $Image->setAttribute('src', 'lalala');
      $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();
echo $text;

Output:

<p>Hi, this is a test, here is an image<img src="lalala" width="60" height="95"> Because I like Beer!</p>
like image 190
Wiktor Stribiżew Avatar answered Sep 18 '22 11:09

Wiktor Stribiżew