Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Does DOM Change Encoding?

Tags:

$string = file_get_contents('http://example.com');

if ('UTF-8' === mb_detect_encoding($string)) {
    $dom = new DOMDocument();
    // hack to preserve UTF-8 characters
    $dom->loadHTML('<?xml encoding="UTF-8">' . $string);
    $dom->preserveWhiteSpace = false;
    $dom->encoding = 'UTF-8';
    $body = $dom->getElementsByTagName('body');
    echo htmlspecialchars($body->item(0)->nodeValue);
}

This changes all UTF-8 characters to Å, ¾, ¤ and other rubbish. Is there any other way how to preserve UTF-8 characters?

Don't post answers telling me to make sure I am outputting it as UTF-8, I made sure I am.

Thanks in advance :)

like image 200
Richard Knop Avatar asked Feb 10 '10 12:02

Richard Knop


2 Answers

I had similar problems recently, and eventually found this workaround - convert all the non-ascii characters to html entities before loading the html

$string = mb_convert_encoding($string, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($string);
like image 123
andrewmabbott Avatar answered Oct 19 '22 19:10

andrewmabbott


In case it is definitely the DOM screwing up the encoding, this trick did it for me a while back the other way round (accepting ISO-8859-1 data). DOMDocument should be UTF-8 by default in any case but you can still try:

    $dom = new DOMDocument('1.0', 'utf-8');
like image 21
Pekka Avatar answered Oct 19 '22 18:10

Pekka