Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting MySQL table with incorrectly encoded data to UTF-8

I've got a big ol' MySQL 5.1 database, and for a variety of stupid reasons, I've been storing, I believe, UTF8 characters encoded as LATIN1 in a UTF8 table. It's... strange. And I'd like to fix it.

The MySQL - Convert latin1 characters on a UTF8 table into UTF8 question seems to work -- a column at a time. But I have 24 tables and dozens of columns to convert. I'm really looking for a solution that'll convert at least a table at once.

For reference, the single-column solution that works for me is:

UPDATE foo SET col1 = CONVERT(CAST(CONVERT(col1 USING latin1) AS binary) USING utf8);

For tables, I can do:

ALTER TABLE foo CONVERT TO CHARACTER SET latin1;
ALTER TABLE foo CONVERT TO CHARACTER SET binary;
ALTER TABLE foo CHARACTER SET utf8  COLLATE utf8_unicode_ci;

which gets me very close -- however, the CONVERT TO CHARACTER SET binary step turns all my VARCHAR columns into VARBINARY and my TEXT columns into BLOBs in one fell swoop. I can go through and change them back and all appears to be well... but then I'm back in the "let's modify all the columns individually" world -- in which case, I may just as well

I've tried about 50 variations on those SQL statements, but I can't find one that both leaves my columns in character data types and encodes the data properly.

Any suggestions?

Update: Deciding to just fix the columns rather than waiting for a database- or table-wise solution, I came up with:

#!/usr/bin/env ruby
require 'rubygems'
require 'mysql2'

CONNECT_OPTS = {} # whatever you want
Mysql2::Client.default_query_options.merge!(:as => :array)
conn = Mysql2::Client.new(CONNECT_OPTS)

tables = conn.query("SHOW TABLES").map {|row| row[0] }

# See http://dev.mysql.com/doc/refman/5.0/en/charset-column.html
# One might want to include enum and set columns; I don't have them
TYPES_TO_CONVERT = %w(char varchar text)
tables.each do |table|
  puts "converting #{table}"
  # Get all the columns and we'll filter for the ones we want
  columns = conn.query("DESCRIBE #{table}")
  columns_to_convert = columns.find_all {|row|
    TYPES_TO_CONVERT.include? row[1].gsub(/\(\d+\)/, '')
  }.map {|row| row[0]}
  next if columns_to_convert.empty?

  query = "UPDATE `#{table}` SET "
  query += columns_to_convert.map {|col|
    "`#{col}` = convert(cast(convert(`#{col}` using latin1) as binary) using utf8)"
  }.join ", "
  puts query
  conn.query query
end

... which gets the job done. Amusingly, this runs on my database in 36 seconds, rather than the ALTER TABLE route which took 13 minutes (and had the VARBINARY problem) or the mysqldump solutions which would take upwards of twenty assuming I could get them to run.

I'll still accept an answer if someone knows an elegant way to do this for a whole database or table in one step.

like image 678
Nate Avatar asked Jun 11 '13 17:06

Nate


1 Answers

This method below looks really promising & better yet, beautiful in its simplicity. The idea is you mysqldump your entire database as latin1, & then import it re-encoded as utf-8.

Export:

mysqldump -u [user] -p --opt --quote-names --skip-set-charset --default-character-set=latin1 [database] > dump.sql

Import:

mysql -u [user] -p --default-character-set=utf8 [database] < dump.sql

I take no credit for this solution, it's completely from Gareth Price's blog. It has worked for everyone who has left him a comment so far: "Wow man you just saved my life. I did not spent 2 hours on this, but 2 days" caught my attention.

Update #1: Looks like Gareth wasn't the first to discover this.

Update #2: I just tried this & it worked beautifully for my UTF8-stored-as-latin1 database. Just make sure you switch the default charset on your database to utf8 before importing, or else you'll end up with plain question marks where the special characters were. Of course this might have plenty of other ramifications so test like hell first.

ALTER SCHEMA [database] DEFAULT CHARACTER SET utf8;

And if you have any tables that aren't set to the schema default:

ALTER TABLE [table] CHARACTER SET = DEFAULT;

(same idea if you have any column-specific charset settings, you'll have to do a ALTER TABLE [table] CHANGE COLUMN [settings] without specifying CHARACTER SET so it goes back to the table default)

like image 176
Wick Avatar answered Oct 23 '22 08:10

Wick