Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 data in Latin1 database: can it be saved?

I have a rails app that receives data from an Android device. I noticed that some of the data, when in Japanese, is not saved correctly. It shows up as literal question marks (not the diamond ones) in the MySQL client and in the rails website.

It turns out that the database that I have connected to the rails app is set to Latin1. Rails is set to UTF-8.

I read a lot about character encodings, but they all mention that the data is somehow a bit readable. Mine however is only literal question marks. Also trying to convert the data to UTF-8 using several methods on the web doesn't change a thing. I suspect that the data is converted to question marks when it's written to the database.

Sample output from the MySQL console:

select * from foo where bar = "foobar";
+-------+------+------------------------+---------------------+---------------------+
| id    | name | bar                    | created_at          | updated_at          |
+-------+------+------------------------+---------------------+---------------------+
| 24300 | ???? | foobar                 | 2012-01-23 05:04:22 | 2012-01-23 05:04:22 |
+-------+------+------------------------+---------------------+---------------------+
1 row in set (0.00 sec)

The input data, that my rails app got from the Android client was:

name = 爆笑笑話

This input data has been verified to exist in the rails app before saving to the database. So it's not mangled in the Android client or during transfer to the server. Is there any chance I can get this data back? Or is it completely lost?

like image 917
Peterdk Avatar asked Dec 22 '12 00:12

Peterdk


People also ask

What is the difference between UTF-8 and Latin1?

what is the difference between utf8 and latin1? They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.

Is UTF-8 a superset of Latin1?

Set of characters representable by UTF-8 is a superset of the set of characters representable by Latin1.

Why do we use encoding in Latin1?

It is used by most Unix systems as well as Windows. DOS and Mac OS, however, use their own sets. Latin-1 is occasionally, though imprecisely, referred to as Extended ASCII. This is because the first 128 characters of its set are identical to the US ASCII standard.


1 Answers

It's actually very easy to think that data is encoded in one way, when it is actually encoded in some other way: this is because any attempt to directly retrieve the data will result in conversion first to the character set of your database connection and then to the character set of your output medium—therefore you should first verify the actual encoding of your stored data through either SELECT BINARY name FROM foo WHERE bar = 'foobar' or SELECT HEX(name) FROM foo WHERE bar = 'foobar'.

Where the character is expected, you will likely find either of the following byte sequences:

  • 0xe78886, indicating that your column actually contains UTF-8 encoded data: this usually happens when the character set of the database connection over which the text was originally inserted was set to latin1 but actually UTF-8 encoded data was sent.

    You must be seeing ? characters when fetching the data because something between the data storage and the display has been unable to transcode those bytes (however, given that MySQL thinks they represent 爆 and those characters are likely available in most character sets, it's unlikely that it's occurring within MySQL itself—unless you're explicitly adjusting the encoding information during retrieval).

    Anyway, if this is the case, you need to drop the encoding information from the column and then tell MySQL that the data is actually encoded as UTF-8. As documented under ALTER TABLE Syntax:

    Warning 

    The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

    ALTER TABLE t1 CHANGE c1 c1 BLOB;
    ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
    

    The reason this works is that there is no conversion when you convert to or from BLOB columns.

  • 0x3f, indicating that the database does actually contain the literal character ? and your original data has been lost: this doesn't happen easily, since MySQL usually throws error 1366 if implicit transcoding results in loss of data. Perhaps there was some explicit transcoding in your insert statement?

    In this case, you need to convert the storage encoding to a suitable format, then update or re-insert the data:

    ALTER TABLE foo CONVERT TO utf8;
    UPDATE foo SET name = _utf8 '爆笑笑話' WHERE bar = 'foobar';
    
like image 184
eggyal Avatar answered Oct 15 '22 09:10

eggyal