Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Manipulating utf8mb4 data from MySQL with PHP

This is probably something simple. I swear I've been looking online for the answer and haven't found it. Since my particular case is a little atypical I finally decided to ask here.

I have a few tables in MySQL that I'm using for a Chinese language program. It needs to be able to support every possible Chinese character, including rare ones that don't have great font support. A sample cell in the table might be this:

東菄鶇䍶𠍀倲𩜍𢘐涷蝀凍鯟𢔅崠埬𧓕䰤

In order to get that to work right in the database, I've had to set the encoding/collation to utf8mb4. So far so good. Unfortunately when I pull the same string into PHP, it gets printed as this:

東菄鶇䍶?倲??涷蝀凍鯟?崠埬?䰤

How can I finally kill off the remaining question marks and get them to show as the unicode glyphs they should be? I've got the php page itself using UTF8 encoding in the tag and as a meta tag.

Why can't they communicate with each other? What am I doing wrong?

like image 672
Yhilan Avatar asked Oct 23 '12 10:10

Yhilan


People also ask

Does MySQL support utf8mb4?

MySQL supports multiple Unicode character sets: utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character.

How do I change utf8mb4 to UTF-8?

To solve the problem open the exported SQL file, search and replace the utf8mb4 with utf8 , after that search and replace the utf8mb4_unicode_520_ci with utf8_general_ci . Save the file and import it into your database. After that, change the wp-config. php charset option to utf8 , and the magic starts.

Is utf8mb4 backwards compatible with UTF-8?

As of MySQL 5.5. 3 (2010), the utf8mb4 charset provides full UTF-8 support, being completely backwards compatible, not requiring more space than utf8 for characters that are in the utf8 set, and using extra byte for characters outside of the utf8 set.


2 Answers

I'd simply guess that you are setting the table to utf8mb4, but your connection encoding is set to utf8. You have to set it to utf8mb4 as well, otherwise MySQL will convert the stored utf8mb4 data to utf8, the latter of which cannot encode "high" Unicode characters. (Yes, that's a MySQL idiosyncrasy.)

On a raw MySQL connection, it will have to look like this:

SET NAMES 'utf8mb4';
SELECT * FROM `my_table`;

You'll have to adapt that to the best way of the client, depending on how you connect to MySQL from PHP (mysql, mysqli or PDO).


To really clarify (yes, using the mysql_ extension for simplicity, don't do that at home):

mysql_connect(...);
mysql_select_db(...);
mysql_set_charset('utf8mb4');     // adapt to your mysql connector of choice

$r = mysql_query('SELECT * FROM `my_table`');

var_dump(mysql_fetch_assoc($r));  // data will be UTF8 encoded
like image 111
deceze Avatar answered Oct 13 '22 00:10

deceze


Just to add to @deceze's answer, I recommend a well-configured MySQL server (for me, in /etc/mysql/mysql.conf.d/mysqld.cnf). Here are the configuration options to make sure you're using utfmb4, although I do recommend going through every MySQL configuration option though, daunting as it is, there are a lot of defaults that are are very non-optimal.

[client]

default-character-set           = utf8mb4

[mysql]

default_character_set           = utf8mb4

[mysqld]

init-connect                    = "SET NAMES utf8mb4"
character-set-client-handshake  = FALSE
character-set-server            = "utf8mb4"
collation-server                = "utf8mb4_unicode_ci"
autocommit                      = 1
block_encryption_mode           = "aes-256-cbc"

That last one is just one that should be default. Also, init-connect deals with not having to execute that every time. Keeps code clean. Now run:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

You should return something like the following:

+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_unicode_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_unicode_ci |
+--------------------------+--------------------+

And looks like you're doing this already, but doesn't hurt to explicitly define on table creation:

CREATE TABLE `mysql_table` (
  `mysql_column` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`mysql_column`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4;

Hope this helps someone.

like image 38
Eugene Avatar answered Oct 13 '22 00:10

Eugene