Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL: character encoding used by SELECT INTO?

Tags:

I'm trying to export some data from a MySQL database, but weird and wonderful things are happening to unicode in that table.

I will focus on one character, the left smartquote: “

When I use SELECT from the console, it is printed without issue:

mysql> SELECT text FROM posts; +-------+ | text  | +-------+ | “foo” | +-------+ 

This means the data are being sent to my terminal as utf-8[0] (which is correct).

However, when I use SELECT * FROM posts INTO OUTFILE '/tmp/x.csv' …;, the output file is not correctly encoded:

$ cat /tmp/x.csv “foo†

Specifically, the is encoded with seven (7!) bytes: \xc3\xa2\xe2\x82\xac\xc5\x93.

What encoding is this? Or how could I tell MySQL to use a less unreasonable encoding?

Also, some miscellaneous facts:

  • SELECT @@character_set_database returns latin1
  • The text column is a VARCHAR(42):
     mysql> DESCRIBE posts; +-------+-------------+------+-----+---------+-------+ | Field | Type        | Null | Key | Default | Extra | +-------+-------------+------+-----+---------+-------+ | text  | varchar(42) | NO   | MUL |         |       | +-------+-------------+------+-----+---------+-------+ 
  • encoded as utf-8 yields \xe2\x80\x9c
  • \xe2\x80\x9c decoded as latin1 then re-encoded as utf-8 yields \xc3\xa2\xc2\x80\xc2\x9c (6 bytes).
  • Another data point: (utf-8: \xe2\x80\xa6) is encoded to \xc3\xa2\xe2\x82\xac\xc2\xa6

[0]: as smart quotes aren't included in any 8-bit encoding, and my terminal correctly renders utf-8 characters.

like image 395
David Wolever Avatar asked Mar 19 '12 04:03

David Wolever


People also ask

Should I use utf8mb4 or UTF-8?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.

What encoding does MySQL use?

MySQL supports multiple Unicode character sets: utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.

How do I make MySQL handle UTF-8?

To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace dbname with the database name: Copy ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci; To exit the mysql program, type \q at the mysql> prompt.

What is the difference between UTF-8 and latin1?

what is the difference between utf8 and latin1? They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.


2 Answers

Newer versions of MySQL have an option to set the character set in the outfile clause:

SELECT col1,col2,col3  FROM table1  INTO OUTFILE '/tmp/out.txt'  CHARACTER SET utf8 FIELDS TERMINATED BY ',' 
like image 114
mvd Avatar answered Sep 20 '22 21:09

mvd


Many programs/standards (including MySQL) assume that "latin1" means "cp1252", so the 0x80 byte is interpreted as a Euro symbol, which is where that \xe2\x82\xac bit (U+20AC) comes from in the middle.

When I try this, it works properly (but note how I put data in, and the variables set on the db server):

mysql> set names utf8; -- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html mysql> create table sq (c varchar(10)) character set utf8; mysql> show create table sq\G *************************** 1. row ***************************        Table: sq Create Table: CREATE TABLE `sq` (   `c` varchar(10) default NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 1 row in set (0.19 sec)  mysql> insert into sq values (unhex('E2809C')); Query OK, 1 row affected (0.00 sec)  mysql> select hex(c), c from sq; +--------+------+ | hex(c) | c    | +--------+------+ | E2809C | “  | +--------+------+ 1 row in set (0.00 sec)  mysql> select * from sq into outfile '/tmp/x.csv'; Query OK, 1 row affected (0.02 sec)  mysql> show variables like "%char%"; +--------------------------+----------------------------+ | Variable_name            | Value                      | +--------------------------+----------------------------+ | character_set_client     | utf8                       |  | character_set_connection | utf8                       |  | character_set_database   | utf8                       |  | character_set_filesystem | binary                     |  | character_set_results    | utf8                       |  | character_set_server     | latin1                     |  | character_set_system     | utf8                       |  | character_sets_dir       | /usr/share/mysql/charsets/ |  +--------------------------+----------------------------+ 8 rows in set (0.00 sec) 

And from the shell:

/tmp$ hexdump -C x.csv 00000000  e2 80 9c 0a                                       |....| 00000004 

Hopefully there's a useful tidbit in there…

like image 30
taavi Avatar answered Sep 21 '22 21:09

taavi