Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 with PHP DOMDocument loadHTML?

Tags:

php

utf-8

Consider this example, test.php:

<?php
$mystr = "<p>Hello, με काचं  ça øy jeść</p>";
var_dump($mystr);
$domdoc = new DOMDocument('1.0', 'utf-8'); //DOMDocument();
$domdoc->loadHTML($mystr); // already here corrupt UTF-8?
var_dump($domdoc);
?>

If I run this with PHP 5.5.9 (cli), I get in terminal:

$ php test.php 
string(50) "<p>Hello, με काचं  ça øy jeść</p>"
object(DOMDocument)#1 (34) {
  ["doctype"]=>
  string(22) "(object value omitted)"
...
  ["actualEncoding"]=>
  NULL
  ["encoding"]=>
  NULL
  ["xmlEncoding"]=>
  NULL
...
  ["textContent"]=>
  string(70) "Hello, με à¤à¤¾à¤à¤  ça øy jeÅÄ"
}

Clearly, the original string is correct as UTF-8, but the textContent of the DOMDocument is incorrectly encoded.

So, how can I get the content as correct UTF-8 in the DOMDocument?

like image 869
sdaau Avatar asked Dec 19 '22 14:12

sdaau


1 Answers

The DOM extension was built on libxml2 whose HTML parser was made for HTML 4 - the default encoding for which is ISO-8859-1. Unless it encounters an appropriate meta tag or XML declaration stating otherwise loadHTML() will assume the content is ISO-8859-1.

Specifying the encoding when creating the DOMDocument as you have does not influence what the parser does - loading HTML (or XML) replaces both the xml version and encoding that you gave its constructor.


Workarounds:

First use mb_convert_encoding() to translate anything above the ASCII range into its html entity equivalent.

$domdoc->loadHTML(mb_convert_encoding($mystr, 'HTML-ENTITIES', 'UTF-8'));

Or hack in a meta tag or xml declaration specifying UTF-8.

$domdoc->loadHTML('<meta http-equiv="Content-Type" content="charset=utf-8" />' . $mystr);
$domdoc->loadHTML('<?xml encoding="UTF-8">' . $mystr);
like image 98
user3942918 Avatar answered Jan 06 '23 05:01

user3942918