Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correct double-encoded UTF-8 strings sitting in MySQL utf8_general_ci fields?

Tags:

php

mysql

utf-8

I have to redesign a class where (amongst other things) UTF-8 strings are double-encoded wrongly:

$string = iconv('ISO-8859-1', 'UTF-8', $string);
:
$string = utf8_encode($string);

These faulty strings have been saved into multiple table fields all over a MySQL database. All fields being affected use collation utf8_general_ci.

Usually I'd setup a little PHP patch script, looping thru the affected tables, SELECTing the records, correct the faulty records by using utf8_decode() on the double-encoded fields and UPDATE them.

As I got many and huge tables this time, and the error only affects german umlauts (äöüßÄÖÜ), I'm wondering if there's a solution smarter/faster than that.

Are pure MySQL solutions like the following safe and recommendable?

 UPDATE `table` SET `col` = REPLACE(`col`, 'ä', 'ä');

Any other solutions/best practices?

like image 302
Jürgen Thelen Avatar asked May 10 '11 14:05

Jürgen Thelen


People also ask

How do I change MySQL encoding to UTF-8?

To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace dbname with the database name: Copy ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci; To exit the mysql program, type \q at the mysql> prompt.

How do I change a character set from latin1 to utf8 in MySQL?

Similarly, here's the command to change character set of MySQL table from latin1 to UTF8. Replace table_name with your database table name. mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; Hopefully, the above tutorial will help you change database character set to utf8mb4 (UTF-8).

What is the difference between utf8mb4 and utf8 charsets in MySQL?

utf-8 can store only 1, 2 or 3 bytes characters, while utf8mb4 can store 4 bytes characters as well. utf-8 is a subset of characters given by utf8mb4 .

How do I change utf8mb4 to utf8?

To solve the problem open the exported SQL file, search and replace the utf8mb4 with utf8 , after that search and replace the utf8mb4_unicode_520_ci with utf8_general_ci . Save the file and import it into your database. After that, change the wp-config. php charset option to utf8 , and the magic starts.


2 Answers

Alter the table to change the column character set to Latin-1. You will now have singly-encoded UTF-8 strings, but sitting in a field whose collation is supposed to be Latin-1.

What you do then is, change the column character set back to UTF-8 via the binary character set - that way MySQL doesn't convert the characters at any point.

ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET latin1
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET binary
ALTER TABLE MyTable MODIFY MyColumn ... CHARACTER SET utf8

(is the correct syntax iirc; put the appropriate column type in where ... is)

like image 182
Hammerite Avatar answered Oct 17 '22 15:10

Hammerite


I tried the posted solutions, but my DB kept spitting up errors. Eventually I stumbled upon the following solution (in a forum I believe, but I can't remember where):

UPDATE table_name SET col_name = CONVERT(CONVERT(CONVERT(col_name USING latin1) USING binary) USING utf8);

and it worked a treat. Hope this helps anyone who stumbled here from desperate google searching like me.

NOTE: This is of course assuming your double encoded character issues originate from an overly helpful MySQL conversion from latin1 to utf8, but I believe that's where most of these "corrupted characters" happen. This basically does the same conversion as mentioned above back to latin1, then binary, then to utf8 (using the binary step as a way to prevent the re-encoding of the already encoded latin1 entities)

like image 26
Owen Waring Avatar answered Oct 17 '22 13:10

Owen Waring