Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 strings in a MySQL database got messed up after configuration change

Tags:

php

mysql

utf-8

I have a MySQL with strings that I left dormant for a while. Now that I picked it up again, I noticed that all the special characters are screwed up. My ISP has ported the server to a different machine, I suspect that this might be when it happened.

The database was populated by a PHP script. Everything was supposed to be in UTF-8, that's what the database is set to.

However, this is what a string looks like now:

fête

Those four special characters are supposed to be one character, ê, the string is meant to be fête.

Now it looks like this is just re-encoded twice, but that doesn't seem right. Those four characters in hex are:

C3 83 C6 92 C3 82 C2 AA

This looks very much like UTF-8, so if we decode it, we get

C3 3F C2 AA

This isn't quite UTF-8 (because of the 3F), but let's decode it again:

FF AA

This is not UTF-8.

The ê character is EA, in UTF-8, that would be C3 AA.

Another example: The Spanish upside-down question mark (¿) is there as C8 83 E2 80 9A C3 82 C2, which decodes to C3 3F 82 BF, which isn't proper UTF-8 again (translates to FF 82 BF). The expected character for ¿ is BF, i.e. C2 BF in proper UTF-8.

What happened here? How did the characters get messed up? More importantly, how do I fix it?

(Side note - the new server requires me to write mysql_set_charset("utf8"); or else strings get messed up too, although in the "UTF-8 as latin1" fashion, not in this weird fashion as seen above.)

TL;DR:

  • MySQL database was populated in UTF-8 through PHP script
  • Lay dormant for years, server got migrated.
  • Now characters are messed up, see above.
like image 661
EboMike Avatar asked Jan 19 '23 11:01

EboMike


2 Answers

C3 83 C6 92 C3 82 C2 AA

This looks very much like UTF-8, so if we decode it, we get

C3 3F C2 AA

That's what you get if you treat the sequence of bytes as UTF-8, then encode it as ISO-8859-1. 3F is ?, which has been included as a replacement character, because UTF-8 C6 92 is U+0192 ƒ which does not exist in ISO-8859-1. But it does exist in Windows code page 1252 Western European, an encoding very similar to ISO-8859-1; there, it's byte 0x83.

C3 83 C2 AA

Go through another round of treat-as-UTF-8-bytes-and-encode-to-cp1252 and you get:

C3 AA

which is, finally, UTF-8 for ê.

Note that even if you serve a non-XML HTML page explicitly as ISO-8859-1, browsers will actually use the cp1252 encoding, due to nasty historical reasons.

Unfortunately MySQL doesn't have a cp1252 encoding; latin1 is (correctly) ISO-8859-1. So you won't be able to fix up the data by dumping as latin1 then reloading as utf8 (twice). You'd have to process the script with a text editor that can save as either (or eg in Python file(path, 'rb').read().decode('utf-8').encode('cp1252').decode('utf-8').encode('cp1252')).

like image 90
bobince Avatar answered Jan 21 '23 01:01

bobince


I suspect you might have your characters stored as UTF8 strings in a latin1 (or similar) database. That's why you've got 'double encoding' problem. Making the database's CHARSET UTF8 should fix it. Dumping/importing the data might be necessary as well, something along those lines:

$ mysqldump --default-character-set=latin1  --skip-set-charset --databases xxx > xxx.sql
$ mysql --default-character-set=utf8 < xxx.sql

But that's just suggestion, might work but don't have to in your specific case.

like image 32
Paweł Gościcki Avatar answered Jan 21 '23 01:01

Paweł Gościcki