DOMDocument encoding problems / characters transformed

Question

I am using DOMDocument to manipulate / modify HTML before it gets output to the page. This is only a html fragment, not a complete page. My initial problem was that all french character got messed up, which I was able to correct after some trial-and-error. Now, it seems only one problem remains : ' character gets transformed into ? .

The code :

<?php
    $dom = new DOMDocument('1.0','utf-8');
         $dom->loadHTML(utf8_decode($row->text));

         //Some pretty basic modification here, not even related to text

         //reinsert HTML, and make sure to remove DOCTYPE, html and body that get added auto.
         $row->text = utf8_encode(preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML())));
?>

I know it's getting messy with the utf8 decode/encode, but this is the only way I could make it work so far. Here is a sample string :

Input : Sans doute parce qu’il vient d’atteindre une date déterminante dans son spectaculaire cheminement

Output : Sans doute parce qu?il vient d?atteindre une date déterminante dans son spectaculaire cheminement

If I find any more details, I'll add them. Thank you for your time and support!

Artefacto · Accepted Answer

Don't use utf8_decode. If your text is in UTF-8, pass it as such.

Unfortunately, DOMDocument defaults to LATIN1 in case of HTML. It seems the behavior is this

If fetching a remote document, it should deduce the encoding from the headers
If the header wasn't sent or the file is local, look for the correspondent meta-equiv
Otherwise, default to LATIN1.

Example of it working:

<?php
$s = <<<HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
Sans doute parce qu’il vient d’atteindre une date déterminante
dans son spectaculaire cheminement
</body>
</html>
HTML;

libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadHTML($s);

echo $d->textContent;

And with XML (default is UTF-8):

<?php
$s = '<x>Sans doute parce qu’il vient d’atteindre une date déterminante'.
    'dans son spectaculaire cheminement</x>';
libxml_use_internal_errors(true);
$d = new domdocument;
$d->loadXML($s);

echo $d->textContent;

Luke · Answer

loadHtml() doesn't always recognize the correct encoding as specified in the Content-type HTTP-EQUIV meta tag.

If the DomDocument('1.0', 'UTF-8') and loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . $html) hacks don't work as they didn't for me (PHP 5.3.13), try this:

Add another <head> section immediately after the opening <html> tag with the correct Content-type HTTP-EQUIV meta tag. Then call loadHtml(), then remove the extra <head> tag.

// Ensure entire page is encoded in UTF-8
$encoding = mb_detect_encoding($body);
$body = $encoding ? @iconv($encoding, 'UTF-8', $body) : $body;

// Insert a head and meta tag immediately after the opening <html> to force UTF-8 encoding
$insertPoint = false;
if (preg_match("/<html.*?>/is", $body, $matches, PREG_OFFSET_CAPTURE)) {
    $insertPoint = mb_strlen( $matches[0][0] ) + $matches[0][1];
}
if ($insertPoint) {
    $body = mb_substr(
        $body,
        0,
        $insertPoint
    ) . "<head><meta http-equiv='Content-type' content='text/html; charset=UTF-8' /></head>" . mb_substr(
        $body,
        $insertPoint
    );
}
$dom = new DOMDocument();

// Suppress warnings for loading non-standard html pages
libxml_use_internal_errors(true);
$dom->loadHTML($body);
libxml_use_internal_errors(false);

// Now remove extra <head>

See this article: http://devzone.zend.com/1538/php-dom-xml-extension-encoding-processing/

David Meister · Answer

This was enough for me, the other answers here were overkill. Given I have an HTML document with an existing HEAD tag. HEAD tags don't have attributes and I had no issues leaving the extra META tag in the HTML for my use-case.

$data = str_ireplace('<head>', '<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />', $data);
$document = new DOMDocument();
$document->loadHTML($data);

Kodie Grantham · Answer

As others have pointed out, DOMDocument and LoadHTML will default to LATIN1 encoding with HTML fragments. It will also wrap your HTML with something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>YOUR HTML</body></html>

So also as others have pointed out, you can fix the encoding by inserting a HEAD element into your HTML with a META element that contains the correct encoding.

However, if you're working with an HTML fragment, you probably don't want the wrapping to happen and you also don't want to keep that HEAD element you inserted.

The following code will insert the HEAD element, and then after processing, using regex will remove all the wrapping elements:

<?php
$html = '<article class="grid-item"><p>Hello World</p></article><article class="grid-item"><p>Goodbye World</p></article>';
$head = '<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>';

libxml_use_internal_errors(true);
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($head . $html);
$xpath = new DOMXPath($dom);

// Loop through all article.grid-item elements and add the "invisible" class to them
$nodes = $xpath->query("//article[contains(concat(' ', normalize-space(@class), ' '), ' grid-item ')]");
foreach($nodes as $node) {
  $class = $node->getAttribute('class');
  $class .= ' invisible';
  $node->setAttribute('class', $class);
}

$content = preg_replace('/<\/?(!doctype|html|head|meta|body)[^>]*>/im', '', $dom->saveHTML());
libxml_use_internal_errors(false);

echo $content;
?>

DOMDocument encoding problems / characters transformed

Tags:

php

utf-8

domdocument

Kyrotomia

4 Answers

Artefacto

Luke

David Meister

Kodie Grantham

Recent Activity

Donate For Us

DOMDocument encoding problems / characters transformed

Tags:

php

utf-8

domdocument

Kyrotomia

4 Answers

Artefacto

Luke

David Meister

Kodie Grantham

Related questions

Recent Activity

Donate For Us