$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str));
$dom->loadHTML($str);
var_dump($dom->saveHTML());
View.
string(5) "UTF-8"
string(158) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello®</p></body></html>
"
Why did my Unicode ®
get converted to ®
and how do I stop this?
Am I going crazy today?
The DOMDocument::getElementsByTagName() function is an inbuilt function in PHP which is used to return a new instance of class DOMNodeList which contains all the elements of local tag name.
DOMDocument::loadHTMLThe function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.
The DOM parser functions are part of the PHP core. There is no installation needed to use these functions.
You can add an xml encoding tag (and take it out later). This works for me on things that are not stock Centos 5.x (ubuntu, cpanel's php):
<?php
$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str));
$dom->loadHTML('<?xml encoding="utf-8">'.$str);
var_dump($dom->saveHTML());
This is what you get:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello®</p></body></html>
Except on days when you get this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello®</p></body></html>
I fixed this decoding the UTF-8 before passing it to loadHTML.
$dom->loadHTML( utf8_decode( $html ) );
saveHTML()
seems to decode special chars like German umlauts to their HTML entities. (Although I set $dom->substituteEntities=false;
... o.O)
This is quite strange, though, as the documentation states:
The DOM extension uses UTF-8 encoding.
(http://www.php.net/manual/de/class.domdocument.php, search for utf8)
Oh dear, encoding in PHP poses problems again and again... never ending story.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With