I am experiencing an issue similar to this question:
nodeValue from DomDocument returning weird characters in PHP
The root cause that I have found can be mimicked with mb_convert_encoding()
In my unit tests, this finally caught the issue:
$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('é',ENT_QUOTES,'UTF-8'),'values match');
The raw value of the UTF-8 data appears to be coming over, and the base codepage of the system upon which PHP is running is most likely not UTF-8.
All the way up until parsing (with an HTML5lib implementation that dumps to DOMDocument) the strings stay clean, UTF-8 friendly. Only at the point of pulling data using
$span->nodeValue
do I see a failure in encoding stability.
My guess is that the htmlentities catch for the domdocument export to nodeValue uses an encoding converter, but disregards the inline encoding value.
Given that my issue is with HTML5, I figured it would be directly related to the newness of the implementation, but it appears to be a broader issue. I haven't been able to find any information on this issue specific to DOMDocument via searches, other than the question mentioned at the beginning.
UPDATE
In the name of moving forward, I have switched over from HTML5lib and DOMDocument over to Simple HTML DOM, and it exports cleanly escaped html which I can then parse back into the correct UTF-8 entities.
Also, one function I did not try was
utf8_decode
So that may be a solution for anyone else experiencing this issue. It solved a related issue I was experiencing with AJAX/PHP, solution found on this blog post from 2009: Overcoming AJaX UTF-8 Encoding Limitation (in PHP)
Just used utf8_decode on a nodeValue and it indeed kinda worked, had the problem with special characters not displaying correctly.
However, some characters still remain problematic, such as the simple quote ' and a few others (œ for example)
So using $element->nodeValue will not work, but utf8_decode($element->nodeValue) will - PARTLY.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With