I'm stuck trying to fetch UTF-8 text in a MySQL database from R. I'm running R on OS X (tried both via the GUI and command line), where the default locale is en_US.UTF-8, and no matter what I try, the query result shows "?" for all non-ASCII characters.
I've tried setting options(encoding='UTF-8')
, DBMSencoding='UTF-8'
when connecting via ODBC, setting Encoding(res$str) <- 'UTF-8'
after fetching the results, as well as 'utf8' variants of each of those, all to no avail. Running the query from the command line mysql client shows the results correctly.
I'm totally stumped. Any ideas why it's not working, or other things I should try?
Here's a fairly minimal test case:
$ mysql -u root mysql> CREATE DATABASE test; mysql> USE test; mysql> CREATE TABLE test (str VARCHAR(10)) ENGINE=InnoDB DEFAULT CHARSET=utf8; Query OK, 0 rows affected (0.02 sec) mysql> INSERT INTO test (str) VALUES ('こんにちは'); Query OK, 1 row affected (0.00 sec) mysql> select * from test; +-----------------+ | str | +-----------------+ | こんにちは | +-----------------+ 1 row in set (0.00 sec)
Querying the table in R using both RODBC and RMySQL shows "?????" for the str column:
> con <- odbcDriverConnect('DRIVER=mysql;user=root', DBMSencoding='UTF-8') > sqlQuery(con, 'SELECT * FROM rtest.test') str 1 ????? > library(RMySQL) Loading required package: DBI > con <- dbConnect(MySQL(), user='root') > dbGetQuery(con, 'SELECT * FROM rtest.test') str 1 ?????
For completeness, here's my sessionInfo:
> sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] RMySQL_0.9-3 DBI_0.2-5 RODBC_1.3-6
To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.
utf8 has been used by MySQL is an alias for the utf8mb3 character set, but this usage is being phased out; as of MySQL 8.0. 28, SHOW statements and columns of Information Schema tables display utf8mb3 instead. For more information, see Section 10.9. 2, “The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)”.
The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.
Thanks to @chooban I found out the connection session was using latin1 instead of utf8. Here are two solutions I found:
SET NAMES utf8
to change the connection character set.CharSet=utf8
in the DSN string. I was not able to run SET NAMES
via ODBC.This question pointed me in the right direction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With