Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I prevent Php's DOMDocument from encoding html entities?

I have a function that replaces anchors' href attribute in a string using Php's DOMDocument. Here's a snippet:

$doc        = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors    = $doc->getElementsByTagName('a');

foreach($anchors as $a) {
    $a->setAttribute('href', 'http://google.com');
}

return $doc->saveHTML();

The problem is that loadHTML($text) surrounds the $text in doctype, html, body, etc. tags. I tried working around this by doing this instead of loadHTML():

$doc        = new DOMDocument('1.0', 'UTF-8');
$node       = $doc->createTextNode($text);
$doc->appendChild($node);
...

Unfortunately, this encodes all the entities (anchors included). Does anyone know how to turn this off? I've already thoroughly looked through the docs and tried hacking it, but can't figure it out.

Thanks! :)

like image 586
thesmart Avatar asked Apr 27 '09 05:04

thesmart


1 Answers

XML has only very few predefined entities. All you html entities are defined somewhere else. When you use loadhtml() these entity definitions are load automagically, with loadxml() (or no load() at all) they are not.
createTextNode() does exactly what the name suggests. Everything you pass as value is treated as text content, not as markup. I.e. if you pass something that has a special meaning to the markup (<, >, ...) it's encoded in a way a parser can distinguish the text from the actual markup (&lt;, &gt;, ...)

Where does $text come from? Can't you do the replacement within the actual html document?

like image 162
VolkerK Avatar answered Nov 04 '22 18:11

VolkerK