Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert latin1 characters on a UTF8 table into UTF8

Only today I realized that I was missing this in my PHP scripts:

mysql_set_charset('utf8');

All my tables are InnoDB, collation "utf8_unicode_ci", and all my VARCHAR columns are "utf8_unicode_ci" as well. I have mb_internal_encoding('UTF-8'); on my PHP scripts, and all my PHP files are encoded as UTF-8.

So, until now, every time I "INSERT" something with diacritics, example:

mysql_query('INSERT INTO `table` SET `name`="Jáuò Iñe"');

The 'name' contents would be, in this case: Jáuò Iñe.

Since I fixed the charset between PHP and MySQL, new INSERTs are now storing correctly. However, I want to fix all the older rows that are "messed" at the moment. I tried many things already, but it always breaks the strings on the first "illegal" character. Here is my current code:

$m = mysql_real_escape_string('¿<?php echo "¬<b>\'PHP &aacute; (á)ţăriîş </b>"; ?> ă-ţi abcdd;//;ñç´พดแทฝใจคçăâξβψδπλξξςαยนñ ;');
mysql_set_charset('utf8');
mysql_query('INSERT INTO `table` SET `name`="'.$m.'"');
mysql_set_charset('latin1');
mysql_query('INSERT INTO `table` SET `name`="'.$m.'"');
mysql_set_charset('utf8');

$result = mysql_iquery('SELECT * FROM `table`');
while ($row = mysql_fetch_assoc($result)) {
    $message = $row['name'];
    $message = mb_convert_encoding($message, 'ISO-8859-15', 'UTF-8');
    //$message = iconv("UTF-8", "ISO-8859-1//IGNORE", $message);
    mysql_iquery('UPDATE `table` SET `name`="'.mysql_real_escape_string($message).'" WHERE `a1`="'.$row['a1'].'"');
}

It "UPDATE"s with the expected characters, except that the string gets truncated after the character "ă". I mean, that character and following chars are not included on the string.

Also, testing with the "iconv()" (that is commented on the code) does the same, even with //IGNORE and //TRANSLIT

I also tested several charsets, between ISO-8859-1 and ISO-8859-15.

like image 564
Nuno Avatar asked Feb 23 '12 05:02

Nuno


People also ask

Does UTF-8 include Latin1?

UTF-8 is prepared for world domination, Latin1 isn't. If you're trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java).

How do I change a character set from Latin1 to UTF-8 in MySQL?

Similarly, here's the command to change character set of MySQL table from latin1 to UTF8. Replace table_name with your database table name. mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; Hopefully, the above tutorial will help you change database character set to utf8mb4 (UTF-8).

What is encoding =' Latin1?

ISO 8859-1 is the ISO standard Latin-1 character set and encoding format. CP1252 is what Microsoft defined as the superset of ISO 8859-1. Thus, there are approximately 27 extra characters that are not included in the standard ISO 8859-1.


2 Answers

From what you describe, it seems you have UTF-8 data that was originally stored as Latin-1 and then not converted correctly to UTF-8. The data is recoverable; you'll need a MySQL function like

convert(cast(convert(name using  latin1) as binary) using utf8)

It's possible that you may need to omit the inner conversion, depending on how the data was altered during the encoding conversion.

like image 169
ABS Avatar answered Oct 22 '22 01:10

ABS


After I searched about an hour or two for this answer, I needed to migrate an old tt_news db from typo into a new typo3 version. I tried to convert the charset in the export file and import it back already, but didn't get it working.

Then I tried the answer above from ABS and started an update on the table:

UPDATE tt_news SET 
    title=convert(cast(convert(title using  latin1) as binary) using utf8), 
    short=convert(cast(convert(short using  latin1) as binary) using utf8), 
    bodytext=convert(cast(convert(bodytext using  latin1) as binary) using utf8)
WHERE 1

You can also convert imagecaption, imagealttext, imagetitletext and keywords if needed. Hope this will help somebody migrating tt_news to new typo3 version.

like image 45
Marcel Grolms Avatar answered Oct 22 '22 02:10

Marcel Grolms