Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument and HTML entities

I'm trying to parse some HTML that includes some HTML entities, like ×

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    

but DomDocument substitutes the text for for A × B.

Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything

like image 743
rafa Avatar asked Aug 28 '11 11:08

rafa


1 Answers

From the docs:

The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

Assuming you're using latin-1 try:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A &#215; B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>
like image 122
Dr.Molle Avatar answered Sep 28 '22 12:09

Dr.Molle