I'm trying to parse some HTML that includes some HTML entities, like ×
$str = '<a href="http://example.com/"> A × B</a>';
$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);
$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');
echo "
fullname: $fullname \n
href: $href\n";
but DomDocument substitutes the text for for A × B.
Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything
From the docs:
The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
Assuming you're using latin-1 try:
<?php
header('Content-type:text/html;charset=iso-8859-1');
$str = utf8_encode('<a href="http://example.com/"> A × B</a>');
$dom = new DOMDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);
$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');
echo "
fullname: $fullname \n
href: $href\n"; ?>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With