I'm trying to export some data from a MySQL database, but weird and wonderful things are happening to unicode in that table.
I will focus on one character, the left smartquote: “
When I use SELECT
from the console, it is printed without issue:
mysql> SELECT text FROM posts; +-------+ | text | +-------+ | “foo” | +-------+
This means the data are being sent to my terminal as utf-8[0] (which is correct).
However, when I use SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;
, the output file is not correctly encoded:
$ cat /tmp/x.csv “fooâ€
Specifically, the “
is encoded with seven (7!) bytes: \xc3\xa2\xe2\x82\xac\xc5\x93
.
What encoding is this? Or how could I tell MySQL to use a less unreasonable encoding?
Also, some miscellaneous facts:
SELECT @@character_set_database
returns latin1
text
column is a VARCHAR(42)
: mysql> DESCRIBE posts; +-------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+-------------+------+-----+---------+-------+ | text | varchar(42) | NO | MUL | | | +-------+-------------+------+-----+---------+-------+
“
encoded as utf-8 yields \xe2\x80\x9c
\xe2\x80\x9c
decoded as latin1
then re-encoded as utf-8
yields \xc3\xa2\xc2\x80\xc2\x9c
(6 bytes).…
(utf-8: \xe2\x80\xa6
) is encoded to \xc3\xa2\xe2\x82\xac\xc2\xa6
[0]: as smart quotes aren't included in any 8-bit encoding, and my terminal correctly renders utf-8 characters.
The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.
MySQL supports multiple Unicode character sets: utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.
To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace dbname with the database name: Copy ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci; To exit the mysql program, type \q at the mysql> prompt.
what is the difference between utf8 and latin1? They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.
Newer versions of MySQL have an option to set the character set in the outfile clause:
SELECT col1,col2,col3 FROM table1 INTO OUTFILE '/tmp/out.txt' CHARACTER SET utf8 FIELDS TERMINATED BY ','
Many programs/standards (including MySQL) assume that "latin1" means "cp1252", so the 0x80 byte is interpreted as a Euro symbol, which is where that \xe2\x82\xac
bit (U+20AC) comes from in the middle.
When I try this, it works properly (but note how I put data in, and the variables set on the db server):
mysql> set names utf8; -- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html mysql> create table sq (c varchar(10)) character set utf8; mysql> show create table sq\G *************************** 1. row *************************** Table: sq Create Table: CREATE TABLE `sq` ( `c` varchar(10) default NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 1 row in set (0.19 sec) mysql> insert into sq values (unhex('E2809C')); Query OK, 1 row affected (0.00 sec) mysql> select hex(c), c from sq; +--------+------+ | hex(c) | c | +--------+------+ | E2809C | “ | +--------+------+ 1 row in set (0.00 sec) mysql> select * from sq into outfile '/tmp/x.csv'; Query OK, 1 row affected (0.02 sec) mysql> show variables like "%char%"; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec)
And from the shell:
/tmp$ hexdump -C x.csv 00000000 e2 80 9c 0a |....| 00000004
Hopefully there's a useful tidbit in there…
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With