Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument breaks encoding?

I run the following code:

$page = '<p>Ä</p>';
$DOM = new DOMDocument;
$DOM->loadHTML($page);
echo 'source:'.$page;
echo 'dom: '.$DOM->getElementsByTagName('p')->item (0)->textContent;

and it outputs the following:

source: Ä

dom: Ã

so, I don't understand why when the text comes through DOMDocument its encoding becomes broken?

like image 662
Mike Avatar asked Oct 01 '12 16:10

Mike


People also ask

What is DOMDocument() in PHP?

The DOMDocument::getElementsByTagName() function is an inbuilt function in PHP which is used to return a new instance of class DOMNodeList which contains all the elements of local tag name.

What is loadHTML?

DOMDocument::loadHTMLThe function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.


2 Answers

Here's a workaround that adds the proper encoding via meta header:

$DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />' . $page);

I'm not sure if that's the actual character set you're trying to use, but adjust where necessary

See also: domdocument character set issue

like image 132
Ja͢ck Avatar answered Oct 25 '22 09:10

Ja͢ck


DOMDocument appears to be treating the input as UTF-8. In this conversion, Ä becomes Ä. Here's the catch: That second character does not exist in ISO-8859-1, but does exist in Windows-1252. This is why you are seeing no second character in your output.

You can fix this by calling utf8_decode on the output of textContent, or using UTF-8 as your page's character encoding.

like image 35
Niet the Dark Absol Avatar answered Oct 25 '22 11:10

Niet the Dark Absol