nodeValue from DOMDocument returning weird characters in PHP

Question

So I'm trying to parse HTML pages and looking for paragraphs (<p>) using get_elements_by_tag_name('p');

The problem is that when I use $element->nodeValue, it's returning weird characters. The document is loaded first into $html using curl then loading it into a DOMDocument.

I'm sure it has to do with charsets.

Here's an example of a response: "aujourdÃ¢Â€Â™hui".

Thanks in advance.

stagl · Accepted Answer

I had the same issues and now noticed that loadHTML() no longer takes 2 parameters, so I had to find a different solution. Using the following function in my DOM library, I was able to remove the funky characters from my HTML content.

private static function load_html($html)
{
    $doc = new DOMDocument;
    $doc->loadHTML('<?xml encoding="UTF-8">' . $html);

    foreach ($doc->childNodes as $node)
        if ($node->nodeType == XML_PI_NODE)
            $doc->removeChild($node);

    $doc->encoding = 'UTF-8';

    return $doc;
}

Admin · Answer

Apparently for me none of the above worked, finally I've found the following:

// Create a DOMDocument instance 
$doc = new DOMDocument();

// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

Source and more info

nodeValue from DOMDocument returning weird characters in PHP

Tags:

php

character-encoding

nodevalue

domdocument

Elie

2 Answers

stagl

Recent Activity

Donate For Us

nodeValue from DOMDocument returning weird characters in PHP

Tags:

php

character-encoding

nodevalue

domdocument

Elie

2 Answers

stagl

Related questions

Recent Activity

Donate For Us