Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL - Invalid utf8mb4 character string on update

My question is almost exactly the same as the one found here

MySQL - 1300 - Invalid utf8 character string on update

No solution was presented and the help from the person (creating a temporary table) didn't seem to help. Here is the select statement I am using:

SELECT
    CONVERT(line_1 USING utf8mb4),
    CONVERT(line_1 USING latin1),
    HEX(line_1)
FROM address
WHERE ((CAST(CONVERT(line_1 USING latin1) AS CHAR)) <> (CAST(line_1 AS CHAR)))
    AND CONVERT(line_1 USING utf8mb4) IS NULL;
+-------------------------------+------------------------------+----------------------------------------------------+
| CONVERT(line_1 USING utf8mb4) | CONVERT(line_1 USING latin1) | hex(line_1)                                        |
+-------------------------------+------------------------------+----------------------------------------------------+
| NULL                          | Högbergsgatan 97             | 48F6676265726773676174616E203937                   |
| NULL                          | Zücherstrasse 161            | 5AFC636865727374726173736520313631                 |
| NULL                          | 2275, Rue de l'Université    | 323237352C20527565206465206C27556E69766572736974E9 |
| NULL                          | Högbergsgatan 97             | 48F6676265726773676174616E203937                   |
+-------------------------------+------------------------------+----------------------------------------------------+

When I try to run the following update command I get:

UPDATE address
SET line_1 = CONVERT(CAST(CONVERT(line_1 USING latin1) AS CHAR) USING utf8mb4)
WHERE (CAST(CONVERT(line_1 USING latin1) AS CHAR) <> CAST(line_1 AS CHAR))
    AND CONVERT(line_1 USING utf8mb4) IS NULL;
ERROR 1300 (HY000): Invalid utf8mb4 character string: 'F66762'

I tried setting the line in the following ways, all producing the same error:

SET line_1 = CAST(CONVERT(line_1 USING latin1) AS CHAR)
SET line_1 = CONVERT(line_1 USING latin1)

I also looked at http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/ to see if maybe it was a double encoding issue, but none of these worked and I kept getting the same character string error.

Additionally, I looked at https://mathiasbynens.be/notes/mysql-utf8mb4 in order to help with the conversion steps, but utf8mb4 and utf8 are causing the exact same issues. (At first I thought it was a utf8 thing so I switched to utf8mb4 and when I still kept getting the same issues, I knew there was a deeper problem)

As you can see, something weird is going on. Looking at my show create address table, I can verify that the charset is set properly:

SHOW CREATE TABLE address;
| address | CREATE TABLE `address` (
  `addressid` bigint(20) NOT NULL AUTO_INCREMENT,
  `addressuuid` char(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `line_1` blob,
  PRIMARY KEY (`addressid`)
) ENGINE=InnoDB AUTO_INCREMENT=48970 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='All potential addresses' |

Additionally, you can see my character variables are correct in my instance:

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

How did I get here

So it might be beneficial to give some background information on the issue just in case it is something in the background that caused the issue.

I had a database that originally was set to latin1 encoding everything. I then ran the following code:

SET NAMES 'latin1';

/* We must change things to blob and then back again */
ALTER TABLE `address` CHANGE line_1 line_1 BLOB;
ALTER TABLE `address` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE `address` CHANGE line_1 line_1 VARCHAR(64);

The reason for switching to blob and then back to varchar is the normal recommended procedure. (www.percona.com/blog/2013/10/16/utf8-data-on-latin1-tables-converting-to-utf8-without-downtime-or-double-encoding/)

Let me know if that helps, and if there is any more information that can be provided. I'm using MySQL 5.6 so theoretically it should handle things better, but who knows. As there are only 4 lines I could just manually update each one, but theoretically there is a greater underlying issue and since I do actually have quite a few more columns to go through, it would be nice to make sure I have a functional way to handle these cases in case I get something with quite a few lines.

like image 872
Aram Papazian Avatar asked Oct 20 '22 17:10

Aram Papazian


1 Answers

Since line_1 is a blob, not a text field, MySQL has no control over the "characters" in it, and does not care if it is non-text information (such as a JPG). In the examples you gave, you have latin1 text in the field (eg, hex F6 for ö). Hence, CONVERT(line_1 USING latin1) worked 'fine'.

I don't understand your goal. Are you trying to read the BLOB as TEXT? If so, and if all the non-ascii characters are encoded latin1, then that CONVERT is the answer.

If your goal is something else, then let's approach it from there.

It was not "doubly encoded", so none of them would work.

ALTER TABLE address CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Do a SHOW CREATE TABLE address and check the CHARACTER SET of line_1.

like image 193
Rick James Avatar answered Oct 22 '22 23:10

Rick James