I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8. At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data. Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually. Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters? Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters? e.g. <code>SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);</code> Is this enough? At the moment I have switched my Mysql client encoding to UTF-8.

Character encoding, like time zones, is a constant source of problems. What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit. To figure out what encoding is correct, you just <code>SELECT</code> two different versions and compare visually. Here's an example: <pre class="prettyprint"><code>SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 FROM users WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']') </code></pre> This is made unusually complicated because the MySQL regexp engine seems to ignore things like <code>\x80</code> and makes it necessary to use the <code>UNHEX()</code> method instead. This produces results like this: <pre class="prettyprint"><code>latin1 utf8 ---------------------------------------- BjÃ¶rn Björn </code></pre>

Since your question is not completely clear, let's assume some scenarios: <ol> <li> Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.</li> <li> Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a <code>?</code>.</li> <li> Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.</li> </ol>

How to detect UTF-8 characters in a Latin1 encoded column - MySQL

2 Answers

Character encoding, like time zones, is a constant source of problems.

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.

This produces results like this:

latin1                utf8
----------------------------------------
BjÃ¶rn                Björn

answered Oct 20 '22 11:10

tadman

Since your question is not completely clear, let's assume some scenarios:

Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.

answered Oct 20 '22 10:10

deceze

Related questions
                            
                                Strange MySQL AVG() anomaly NULL values
                            
                                cannot grant privileges to mysql database
                            
                                MySQL select that returns a dummy column?
                            
                                MySQL - Is it possible to get 'the difference' of two query results?
                            
                                MYSQL: User - profile details table setup - best practice
                            
                                Grouping WHERE clauses in Codeigniter
                            
                                Recovering mysql database from data folder backup
                            
                                mysql_upgrade failed - innodb tables doesn't exist?
                            
                                Mysql Select Second Row
                            
                                Check for database connection, otherwise display message
                            
                                How to get rid of MySQL error 'Prepared statement needs to be re-prepared'
                            
                                Is it possible to reference a Foreign Key in a different database in Laravel?
                            
                                Where does Sequel Pro / MAMP store local databases?
                            
                                MySQL - select data from database between two dates
                            
                                pyMySQL set connection character set
                            
                                PHP Check MySQL Last Row
                            
                                This PDO prepared statement returns false but does not throw an error
                            
                                mySQL SELECT timestamp(now()-3000);
                            
                                Escape string for use in MySQL fulltext search
                            
                                Opposite of MySQL FIND_IN_SET

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to detect UTF-8 characters in a Latin1 encoded column - MySQL

Tags:

mysql

character-encoding

utf-8

latin1

dinie

People also ask

2 Answers

tadman

deceze

Recent Activity

Donate For Us