I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.
At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.
Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.
Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?
Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters?
e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);
Is this enough?
At the moment I have switched my Mysql client encoding to UTF-8.
These characters and symbols are part of a much larger encoding system called UTF8, which also includes Latin1. Since WRDS' inception, all of our data has been stored in Latin1 encoding. As WRDS becomes much more global in scope and much more text-heavy, the need to move to UTF-8 encoding is apparent.
what is the difference between utf8 and latin1? They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.
You do that by calling str. valid_encoding? on a String str that is in UTF-8 -encoding. Does that not get clear from my answer? Programmatically, you can not (or at least not easily and of course not reliably) check the invalidity of a string in a one-byte-encoding such as CP1252 .
Character encoding, like time zones, is a constant source of problems.
What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.
To figure out what encoding is correct, you just SELECT
two different versions and compare visually. Here's an example:
SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1,
CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8
FROM users
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')
This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80
and makes it necessary to use the UNHEX()
method instead.
This produces results like this:
latin1 utf8
----------------------------------------
Björn Björn
Since your question is not completely clear, let's assume some scenarios:
?
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With