Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The dreaded MySQL import encoding issue - revisited

I'm having the standard MySQL import encoding issue, but I can't seem to solve it.

My client has had a WordPress installation running for some time. I've dumped the database to a file, and imported it locally. The resulting pages have a splattering of � characters throughout.

I've inspected the database properties on both sides: production: show create database wordpress;

CREATE DATABASE `wordpress` /*!40100 DEFAULT CHARACTER SET latin1 */

local: show create database wordpress;

CREATE DATABASE `wordpress` /*!40100 DEFAULT CHARACTER SET latin1 */

production: show create table wp_posts;

CREATE TABLE `wp_posts` (
  `ID` bigint(20) unsigned NOT NULL auto_increment,
  ...
  KEY `post_date_gmt` (`post_date_gmt`)
) ENGINE=MyISAM AUTO_INCREMENT=7932 DEFAULT CHARSET=utf8

local: show create table wp_posts;

CREATE TABLE `wp_posts` (
  `ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  ...
  KEY `post_date_gmt` (`post_date_gmt`)
) ENGINE=MyISAM AUTO_INCREMENT=7918 DEFAULT CHARSET=utf8

I've spent hours reading forums on how to squash the �, but I can't get anything to work. 99% of the answers say to match the character set between the databases. What I think should work if the following:

mysqldump --opt --compress --default-character-set=latin1 -uusername -ppassword wordpress | ssh [email protected] mysql --default-character-set=latin1 -uusername -ppassword wordpress

I've done it using the utf8 char-set as well. Still with the �'s.

I've tried modifying the SQL dump directly, putting with utf8 or latin1 in the "SET names UTF8" line. Still with the �'s.

Strange Symptoms

I'd expect these � characters to appear in place of special characters in the content, like ñ or ö, but I've seen it where there would normally be just a space. I've also seen it in place of apostrophes (but not all apostrophes), double quotes, and trademark symbols.

The � marks are pretty rare. They appear on average three to four times per page.

I don't see any �'s when viewing the database through Sequel Pro (locally or live). I don't see any �'s in the SQL when viewing through Textmate.

What am I missing?

EDIT

More info:

I've tried to determine what the live database thinks the encoding is. I ran show table status, and it seems that the Collations are a mix of utf8_general_ci,utf8_binandlatin1_swedish_ci`. What are they different? Does it matter?

I also ran: show variables like "character_set_database" and got latin1;

like image 754
Matt Avatar asked May 12 '11 21:05

Matt


1 Answers

This is how I ended up solving my problem:

First mysqldump -uusername -ppassword --default-character-set=latin1 database -r dump.sql

Then run this script:

$search = array('/latin1/');
$replace = array('utf8');
foreach (range(128, 255) as $dec) {
    $search[] = "/\x".dechex($dec)."/";
    $replace[] = "&#$dec;";
}

$input = fopen('dump.sql', 'r');
$output = fopen('result.sql', 'w');

while (!feof($input)) {
    $line = fgets($input);
    $line = preg_replace($search, $replace, $line);
    fwrite($output, $line);
}

fclose($input);
fclose($output);

The script finds all the hex characters above 127 and encoded them into their HTML entities.

Then mysql -uusername -ppassword database < result.sql

like image 83
Matt Avatar answered Oct 19 '22 02:10

Matt