Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is DOMDocument doing to my string?

$dom = new DOMDocument('1.0', 'UTF-8');

$str = '<p>Hello®</p>';

var_dump(mb_detect_encoding($str)); 

$dom->loadHTML($str);

var_dump($dom->saveHTML()); 

View.

Outputs

string(5) "UTF-8"

string(158) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello&Acirc;&reg;</p></body></html>
"

Why did my Unicode ® get converted to &Acirc;&reg; and how do I stop this?

Am I going crazy today?

like image 598
alex Avatar asked Feb 21 '11 05:02

alex


People also ask

What is DOMDocument() in PHP?

The DOMDocument::getElementsByTagName() function is an inbuilt function in PHP which is used to return a new instance of class DOMNodeList which contains all the elements of local tag name.

What is loadHTML?

DOMDocument::loadHTMLThe function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.

Is there a DOM in PHP?

The DOM parser functions are part of the PHP core. There is no installation needed to use these functions.


2 Answers

You can add an xml encoding tag (and take it out later). This works for me on things that are not stock Centos 5.x (ubuntu, cpanel's php):

<?php
$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str)); 
$dom->loadHTML('<?xml encoding="utf-8">'.$str);
var_dump($dom->saveHTML()); 

This is what you get:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&reg;</p></body></html>

Except on days when you get this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&Acirc;&reg;</p></body></html>
like image 118
Jan Avatar answered Sep 28 '22 11:09

Jan


I fixed this decoding the UTF-8 before passing it to loadHTML.

$dom->loadHTML( utf8_decode( $html ) );

saveHTML() seems to decode special chars like German umlauts to their HTML entities. (Although I set $dom->substituteEntities=false;... o.O)

This is quite strange, though, as the documentation states:

The DOM extension uses UTF-8 encoding.

(http://www.php.net/manual/de/class.domdocument.php, search for utf8)

Oh dear, encoding in PHP poses problems again and again... never ending story.

like image 33
graup Avatar answered Sep 28 '22 11:09

graup