Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I fix encoding issues with MySql's JSON and Java?

Day old problem, when returning a query from the MySql database I get characters like ç instead of ç.

I'm using a really simple class to try to pin the problem down:

package com.dataTest;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

public class EncodingTest {

    public static void main(String[] args) throws Exception {
        Class.forName("com.mysql.jdbc.Driver");

        Connection connection = DriverManager
                .getConnection("jdbc:mysql://internalip:3306/databasename?" + 
                               "user=user" +
                               "&password=password" + 
                               "&characterEncoding=utf-8" + 
                               "&useUnicode=yes");

        PreparedStatement stmt =
                connection.prepareStatement("SELECT * FROM `databasename`.teste_json;");
        ResultSet rs = stmt.executeQuery();

        while (rs.next()) {
            System.out.println(rs.getString("info"));
        }

        rs.close();
        stmt.close();
        connection.close();
    }
}

The information was added into the database via other class, but I changed it manually using the Workbench, problem remains.

The database collation is utf8_general_ci and the table encoding to utf8.

I really don't know what to do.

EDIT:

I copied the JSON string to a varchar(1500) field and it prints perfectly. I forgot to mention, the field mentioned in the query (info) is a JSON field.

EDIT2: It's not mojibake?

Question was closed based on Rick James answer claiming it was Mojibake.

Acording to the question Trouble with utf8 characters; what I see is not what I stored the following observations:

  • The bytes to be stored need to be UTF-8-encoded. Fix this.

I think they are stored correctly. I made a simple JSON for the HEX, LENGTH and CHAR_LENGTH like this:

'[{\"é\": \"\"}]', '5B7B22C3A9223A2022227D5D', '12', '11'
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.

The connection string specifies the charset:

&useUnicode=yes&characterEncoding=UTF-8
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.

Well, the database has charset/collation = utf8/utf8_general_ci The table has collation utf8_unicode_ci The JSON column (which is the problematic one) has no collation.

How odd.

The manual states

MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation. Strings in other character sets are converted to utf8mb4 as necessary. (For strings in the ascii or utf8 character sets, no conversion is needed because ascii and utf8 are subsets of utf8mb4.)

From https://dev.mysql.com/doc/refman/5.7/en/json.html

Doesn't this mean that the actual collation is already correct?

From what I understand, the information inserted in the server is correct (see HEX information) so the problem is selecting? But the connection string seems to be correct in all forms.

EDIT: NOT A DUPLICATE.

The issues in this case ARE NOT THE SAME in the other question, see my own answer to this question and the bug report mentioned.

like image 536
Johnny Bigoode Avatar asked Apr 10 '17 18:04

Johnny Bigoode


2 Answers

TLDR: Use convert USING one of the following utf8mb4, utf16 or utf32.

As commented by user pvg, MySql seems to store their JSON formats in a utf8 character type (utf8mb4, utf8, ascii), at least that's that their documentation says. According to this bug report (https://bugs.mysql.com/bug.php?id=81677), JSON are stored in utf8mb4, but the encoding is not supported by the current JDBC driver (as in April/2017, that's the artifact mysql-connector-java, version 6.0.6).

Interestingly enough, MySql has a CONVERT() function that can be used like this CONVERT(info USING utf8)

Changing my query to SELECT field1, field2, CONVERT(info USING utf8) as info FROM databasename.teste_json; completely fixed my issue.

Here are some results with the different encoding types in MySql.

I changed the field name for something more complex just to check what would happen. The string I used was 😄ãõêçé日本語のキーボード

In the Java program I removed all encoding references in the connection string, these were the results using CONVERT using several encodings found in MySql documentation (https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html):

big5     {"name": "??????日本語のキ?ボ?ド"}
dec8     {"name": "?ãõêçé?????????"}
cp850    {"name": "?ãõêçé?????????"}
hp8      {"name": "?âêÁµÅ?????????"}
koi8r    {"name": "???????????????"}
latin1   {"name": "?ãõêçé?????????"}
latin2   {"name": "????çé?????????"}
swe7     {"name": "?????`?????????"}
ascii    {"name": "???????????????"}
ujis     {"name": "?ãõêçé日本語のキーボード"}
sjis     {"name": "??????日本語のキーボード"}
hebrew   {"name": "???????????????"}
tis620   {"name": "???????????????"}
euckr    {"name": "??????日本語のキ?ボ?ド"}
koi8u    {"name": "???????????????"}
gb2312   {"name": "???ê?é日本?のキ?ボ?ド"}
greek    {"name": "???????????????"}
cp1250   {"name": "????çé?????????"}
gbk      {"name": "???ê?é日本語のキーボード"}
latin5   {"name": "?ãõêçé?????????"}
armscii8 {"name": "???????????????"}
utf8     {"name": "?ãõêçé日本語のキーボード"}
ucs2     {"name": "?ãõêçé日本語のキーボード"}
cp866    {"name": "???????????????"}
keybcs2  {"name": "?????é?????????"}
macce    {"name": "??õ??é?????????"}
macroman {"name": "?ãõêçé?????????"}
cp852    {"name": "????çé?????????"}
latin7   {"name": "??õ??é?????????"}
utf8mb4  {"name": "😄ãõêçé日本語のキーボード"}
cp1251   {"name": "???????????????"}
utf16    {"name": "😄ãõêçé日本語のキーボード"}
cp1256   {"name": "???êçé?????????"}
cp1257   {"name": "??õ??é?????????"}
utf32    {"name": "😄ãõêçé日本語のキーボード"}
binary   {"name": "ðãõêçéæ¥æ¬èªã®ã­ã¼ãã¼ã"}
geostd8  {"name": "???????????????"}
cp932    {"name": "??????日本語のキーボード"}
eucjpms  {"name": "?ãõêçé日本語のキーボード"}

Please note that the only encodings that managed to show correctly all characters, including the smiley face were utf8mb4, utf16, utf32.

EDIT: From all the discussion in the other post, it seems to be some kind of error in the JDBC driver, at least everything points to it, the currently possible workaround is to force CONVERT using utf8mb4. I strongly recommend reading through Rick James material on the subject if you wish to understand more about encoding problems.

like image 54
Johnny Bigoode Avatar answered Oct 17 '22 01:10

Johnny Bigoode


That's "Mojibake". The problem and solution is discussed in

Trouble with UTF-8 characters; what I see is not what I stored

Java: Note that it is spelled UTF-8, not utf8, nor utf8mb4:

Add useUnicode=true&characterEncoding=UTF-8 to the JDBC URL

<%@ page language="java" pageEncoding="UTF-8"%>
<%@ page contentType="text/html;charset=UTF-8" %>

compileJava.options.encoding = 'UTF-8'
<form method="post" action="/your/url/" accept-charset="UTF-8">
like image 21
Rick James Avatar answered Oct 16 '22 23:10

Rick James