Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserve utf8 when loading HTML from file

Well, apparently, PHP and it's standard libraries have some problems, and DOMDocument isn't an exception.

There are workarounds for utf8 characters when loading HTML string - $dom->loadHTML().

Apparently, I haven't found a way to do this when loading HTML from file - $dom->loadHTMLFile(). While it reads and sets the encoding from <meta /> tags, the problem strikes back if I haven't defined those. For instance, when loading a fragment of HTML (template part, like, footer.html), not a fully built HTML document.

So, how do I preserve utf8 characters, when loading HTML from file, that hasn't got it's <meta /> keys present, and defining those is not an option?

Update

footer.html (the file is encoded in UTF-8 without BOM):

<div id="footer">
    <p>My sūpēr ōzōm ūtf8 štrīņģ</p>
</div>

index.php:

$dom = new DOMDocument;
$dom->loadHTMLFile('footer.html');
echo $dom->saveHTML(); // results in all familiar effed' up characters

Thanks in advance!

like image 267
tomsseisums Avatar asked Dec 03 '22 00:12

tomsseisums


1 Answers

Try a hack like this one:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

Several others are listed in the user comments here: http://php.net/manual/en/domdocument.loadhtml.php. It is also important that your document head includea meta tag to specify encoding FIRST, directly after the tag.

like image 113
Sinthia V Avatar answered Dec 18 '22 09:12

Sinthia V