Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting iso-8859-1 data to UTF-8 in UTF8 and Latin1 tables

Problem Summary:

While trying to convert a site with mysql database from latin1 to utf8, some special characters are not displaying correctly despite ensuring charsets are all utf8 system wide.

Problem Details:

This is a common problem. But I seem to have an added complexity.

Years ago, a oblivious developer (me), put a site together with MySQL. Some tables were setup with latin1_swedish_ci and utf8_general_ci. All input/display was done via pages with iso-8859-1 charset.

Now, I have the task of turning all this data into utf-8 and thus finally uniforming the encoding. However, i'm having issues with a number of special characters in both instances (ie: ü). The characters don't seem to display correctly on a UTF-8 page. They display as �.Instead When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't.

I've tried a number of things:

  1. Percona script: https://github.com/rlowe/mysql_convert_charset
  2. converting col to binary and then to utf8
  3. converting utf8 tables to latin and then repeat above process

Nothing seems to cure the data.

Dumping the entire database and important isn't really a viable option as it's a huge database now and downtime is restricted.

UPDATE (22-Oct-2013)

I've taken @deceze suggestions and reviewed all my content encoding areas as per http://kunststube.net/frontback/. I did find a few places in which I was still passing/encoding data in latin1. So, i've now changed it all over to UTF-8. However, the data is still displaying incorrectly in a particular field. In a table which is in utf8 (no columns have implicit encoding), field1 is in latin1. I can confirm this by running the following which displays the text correctly:

select convert(cast(convert(field1 using latin1) as binary) using utf8) from my table WHERE id = 1

This will convert Hahnemühle to Hahnemühle.

In field2, it appears the data is in a different (unknown) encoding. The query above, when used on field2 converts Hahnem�hle to Hahnem�hle. I've gone through all the charsets on http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html replacing latin1 but none seem to spit out the data correctly.

like image 375
David Avatar asked Dec 16 '22 05:12

David


2 Answers

Setting a column to latin1 and others to utf8 is perfectly fine in MySQL. There's no problem to be solved here as such. This charset parameter just influences how the data is stored internally. Which of course also means that you cannot store, for example, "漢字" in a latin1 column. But assuming you're just storing "Latin-1 characters" in there, that's fine.

MySQL has something commonly called the connection encoding. It tells MySQL what encoding text is in that you send to it from PHP (or elsewhere), and what encoding you'd like back when retrieving data from MySQL. The column charset, the "input connection encoding" and "output connection encoding" can all be different things, MySQL will convert encodings on the fly accordingly as needed.

So, assuming you've used the correct connection encodings so far and data is stored properly in your database and you've not tried to store non-Latin-1 characters in Latin-1 columns, all you need to do to update your column charsets to UTF-8 is:

ALTER TABLE table MODIFY column TEXT [...] CHARACTER SET utf8;
like image 106
deceze Avatar answered Dec 19 '22 11:12

deceze


You can try mysqldump to convert from ISO-8859-1 to utf-8:

mysqldump --user=username --password=password --default-character-set=latin1 --skip-set-charset dbname > dump.sql
chgrep latin1 utf8 dump.sql (or when you prefer  sed -i "" 's/latin1/utf8/g' dump.sql) 
mysql --user=username --password=password --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql
like image 34
Micromega Avatar answered Dec 19 '22 09:12

Micromega