I'm using PHPUnit to validate XML output from my PHP code, but apparently I have problems with the character encoding MySQL returns. Here is the error I get from DOMDocument:
Input is not proper UTF-8, indicate encoding! Bytes: 0xE9 0x20 0x42 0x65
I initialize the DOMDocument so it uses the correct encoding:
$domDocument = new DOMDocument('1.0','UTF-8');
And when I check the output from saveXML() using mb_detect_encoding the result is UTF-8.
I also checked all the calls used to create the XML, using mb_detect_encoding on all createCDATASection parameters encountered and they are all either UTF-8 or ASCII (there are no plain text nodes, everything is in CDATA blocks).
I think the issue comes from the use of an 'é' character (which is 0xE9 in ISO 8859-1). The line which adds that character to my XML is:
$domDocument->createCDATASection($place->name);
and mb_detect_encoding($place->name) gives me UTF-8.
The data ($place->name) is pulled from a MySQL database. This database has the UTF-8 charset.
Here is some example code:
$query = sprintf('SELECT name FROM place where id = 1'); $result = mysql_query($query); $result = mysql_fetch_assoc($result); // -- Feeding UTF-8 data directly WORKS $domDocument = new DOMDocument('1.0','UTF-8'); $rootNode = $domDocument->createElement('Response'); $rootNode->appendChild($domDocument->createCDATASection('Café Belga')); $domDocument->appendChild($rootNode); $matcher = array('tag' => 'Response'); self::assertTag($matcher, $domDocument->saveXML(), '', FALSE); // -- Feeding UTF-8 data from the resultset FAILS $domDocument = new DOMDocument('1.0','UTF-8'); $rootNode = $domDocument->createElement('Response'); $rootNode->appendChild($domDocument->createCDATASection($result['name'])); $domDocument->appendChild($rootNode); $matcher = array('tag' => 'Response'); self::assertTag($matcher, $domDocument->saveXML(), '', FALSE);
In my PHPStorm debugger, the string fetched from the database looks like this:
Caf� Belga
So I think that is the root of the problem. In MySQLWorkbench the string is correct: Café Belga.
When using utf8_encode($result['name'])
, however, everything works fine!
One more check in the watches window:
mb_detect_encoding($result['name'])
-> "UTF-8"
mb_detect_encoding(utf8_encode($result['name']))
-> "UTF-8"
On a side note, are there any sites where I can simply copy-paste those hex values and see what characters they are supposed to be in different character sets?
MySQL supports multiple Unicode character sets: utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character.
The MySQL server has a compiled-in default character set and collation. To change these defaults, use the --character-set-server and --collation-server options when you start the server.
Use the ALTER DATABASE and ALTER TABLE commands. The CONVERT TO technique assumes that the text was correctly stored in some other charset (eg, latin1), and not mangled (such as UTF-8 bytes crammed into latin1 column without conversion to latin1).
You have to define the connection to your database as UTF-8:
// Set up your connection $connection = mysql_connect('localhost', 'user', 'pw'); mysql_select_db('yourdb', $connection); mysql_query("SET NAMES 'utf8'", $connection); // Now you get UTF-8 encoded stuff $query = sprintf('SELECT name FROM place where id = 1'); $result = mysql_query($query, $connection); $result = mysql_fetch_assoc($result);
From version PHP 5.5.0 you should use
mysqli_set_charset($connection,"utf8");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With