I have some MySQL tables with utf8mb4 fields, and others with utf8.
It is safe to use utf8mb4 in the PDO connection string for all tables? Or do I have to convert everything to utf8mb4, or start two different PDO connections?
EDIT: The question is not "can I store 4-byte characters into utf8 columns?" We already know we can't, that doesn't depend on the connection, so if a column is utf8 it means it will not receive 4 bytes characters, for example country or currency codes, email addresses, usernames... where the input is validated by the application.
As of MySQL 5.5. 3 (2010), the utf8mb4 charset provides full UTF-8 support, being completely backwards compatible, not requiring more space than utf8 for characters that are in the utf8 set, and using extra byte for characters outside of the utf8 set.
utf-8 can store only 1, 2 or 3 bytes characters, while utf8mb4 can store 4 bytes characters as well. utf-8 is a subset of characters given by utf8mb4 .
If you need a database, don't use MySQL or MariaDB. Use PostgreSQL. If you need to use MySQL or MariaDB, never use โutf8โ. Always use โutf8mb4โ when you want UTF-8.
This can be tested quite easily with the following script:
<?php
$pdo = new PDO('mysql:host=localhost;dbname=test', 'test', '');
$pdo->exec("
drop table if exists utf8_test;
create table utf8_test(
conn varchar(50) collate ascii_bin,
column_latin1 varchar(50) collate latin1_general_ci,
column_utf8 varchar(50) collate utf8_unicode_ci,
column_utf8mb4 varchar(50) collate utf8mb4_unicode_ci
);
");
$latin = 'abc รคลรฉ';
$utf8 = 'โโ';
$mb4 = '๐ ๐ฃ';
$pdo->exec("set names utf8");
$pdo->exec("
insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
values ('utf8', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
");
$pdo->exec("set names utf8mb4");
$pdo->exec("
insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
values ('utf8mb4', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
");
$result = $pdo->query('select * from utf8_test')->fetchAll(PDO::FETCH_ASSOC);
var_export($result);
And this is the result:
array (
0 =>
array (
'conn' => 'utf8',
'column_latin1' => 'abc รคลรฉ',
'column_utf8' => 'abc รคลรฉ โโ',
'column_utf8mb4' => 'abc รคลรฉ โโ ???? ????',
),
1 =>
array (
'conn' => 'utf8mb4',
'column_latin1' => 'abc รคลรฉ',
'column_utf8' => 'abc รคลรฉ โโ',
'column_utf8mb4' => 'abc รคลรฉ โโ ๐ ๐ฃ',
),
)
As you can see, we can not use utf8
as connection charset, when we work with utf8mb4
columns (see ????
). But we can use utf8mb4
for connection when working with utf8
columns. Also neither has problems writing to and reading from latin
or ascii
columns.
The reason is that you can encode any utf8
, latin
or ascii
character in utf8mb4
but not the other way around. So using utf8mb4
as character set for connection is safe in this case.
Short Answer: Yes, if you are only using 3-byte (or shorter) UTF-8 characters.
Or... No if you intend to work with 4-byte UTF-8 characters such as ๐ ๐๐.
Long Answer:
(And I will address why "no" could be the right answer.)
The connection establishes what encoding the client is using.
The CHARACTER SET
on a column (or, by default, from the table) establishes what encoding can be put into the column.
CHARACTER SET utf8
is a subset of utf8mb4
. That is, all characters acceptable to utf8
(via connection or column) are acceptable to utf8mb4
. Phrased another way, MySQL's utf8mb4
(same as the outside world's UTF-8
) have the full 4-byte utf-8 encoding that includes more Emoji, more Chinese, etc, than MySQL up-to-3-byte utf8
(aka "BMP")
(Technically, utf8mb4
only handles up to 4 bytes, but UTF-8
handles longer characters. However, I doubt if 5-byte characters will happen in my lifetime.)
So, here's what happens with any 3-byte (or shorter) UTF-8 character in the client, given that the Connection is utf8mb4 and the columns in the tables are only utf8: Every character goes into and comes out of the server without transformation and without errors. Note: The problem occurs on INSERT
, not on SELECT
; however you may not notice the problem until you do a SELECT
.
But, what if you have an Emoji in the client? Now you will get an error. (Or a truncated string) (Or question mark(s)) This is because the 4-byte Emoji (eg, ๐ฉ) cannot be squeezed into the 3-byte "utf8" (or "1-byte latin1" or ...).
If you are running 5.5 or 5.6, you may run into the 767 (or 191) problem. I provide several workarounds in here. None is perfect.
As for inverting (utf8 connection but utf8mb4 columns): The SELECT
can have trouble if you manage to get some 4-byte characters into the table.
"Official sources" -- Good luck. I have spent a decade trying to tease out the ins and outs of character handling, and then simplify it into actionable sentences. Much of that time was thinking I had all the answers, only to encounter yet another failing test case. The common cases are listed in Trouble with UTF-8 characters; what I see is not what I stored . However, that does not directly address your question!
From Comment
mysql> SHOW CREATE TABLE emoji\G
*************************** 1. row ***************************
Table: emoji
Create Table: CREATE TABLE `emoji` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`text` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)
mysql> insert into emoji (text) values ("abc");
Query OK, 1 row affected (0.01 sec)
mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
The above says that the "connection" (think "client") is using utf8, not utf8mb4.
mysql> insert into emoji (text) values ("๐
๐๐"); -- 4-byte Emoji
Query OK, 1 row affected, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+----------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+----------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x98\x85\xF0\x9F...' for column 'text' at row 1 |
+---------+------+----------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Now, change the 'connection' to utf8mb4
:
mysql> SET NAMES utf8mb4;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into emoji (text) values ("๐
๐๐");
Query OK, 1 row affected (0.01 sec)
mysql> SELECT * FROM emoji;
+----+--------------+
| id | text |
+----+--------------+
| 1 | ? ? ? ? |
| 2 | abc |
| 3 | ???????????? | -- from when "utf8" was in use
| 4 | ๐
๐๐ | -- Success with utf8mb4 in use
+----+--------------+
4 rows in set (0.01 sec)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With