Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP DOMDocument nodeValue dumps literal UTF-8 characters instead of encoded

I am experiencing an issue similar to this question:

nodeValue from DomDocument returning weird characters in PHP

The root cause that I have found can be mimicked with mb_convert_encoding()

In my unit tests, this finally caught the issue:

$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('é',ENT_QUOTES,'UTF-8'),'values match');

The raw value of the UTF-8 data appears to be coming over, and the base codepage of the system upon which PHP is running is most likely not UTF-8.

All the way up until parsing (with an HTML5lib implementation that dumps to DOMDocument) the strings stay clean, UTF-8 friendly. Only at the point of pulling data using

$span->nodeValue

do I see a failure in encoding stability.

My guess is that the htmlentities catch for the domdocument export to nodeValue uses an encoding converter, but disregards the inline encoding value.

Given that my issue is with HTML5, I figured it would be directly related to the newness of the implementation, but it appears to be a broader issue. I haven't been able to find any information on this issue specific to DOMDocument via searches, other than the question mentioned at the beginning.

UPDATE

In the name of moving forward, I have switched over from HTML5lib and DOMDocument over to Simple HTML DOM, and it exports cleanly escaped html which I can then parse back into the correct UTF-8 entities.

Also, one function I did not try was

utf8_decode

So that may be a solution for anyone else experiencing this issue. It solved a related issue I was experiencing with AJAX/PHP, solution found on this blog post from 2009: Overcoming AJaX UTF-8 Encoding Limitation (in PHP)

like image 743
Dave Espionage Avatar asked Oct 11 '22 14:10

Dave Espionage


1 Answers

Just used utf8_decode on a nodeValue and it indeed kinda worked, had the problem with special characters not displaying correctly.

However, some characters still remain problematic, such as the simple quote ' and a few others (œ for example)

So using $element->nodeValue will not work, but utf8_decode($element->nodeValue) will - PARTLY.

like image 188
Patrick Avatar answered Oct 15 '22 10:10

Patrick