Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explanation of how charset affects storage/display

I find it infuriating that I don't understand this yet, but maybe some explanation will help. This is a two part question, but hopefully both parts are small and directly related:

Display

We recently had an issue where content was inserting U+00a0 (non-breaking space) characters into a DB column with the latin1 charset. Simply doing SELECT prints out "Â" in the column. I'm not sure if this is a product of selecting or of the display, but I believe it is the former. SELECT BINARY col instead prints out " ", as well it should since my shell has $LANG = en_US.utf8.

A more visible example is "â„¢" vs. "™"

Using SELECT CONVERT(col USING utf8) still prints out "Â" and "â„¢" -- I wouldn't necessarily expect it to do differently, but where does the problem stem from? Is it a problem that occurs at storage time? Is there a way to get the UTF8 display out of the DB instead of relying on the UI to display it properly (if that makes sense?)

Storage

In an attempt to reproduce this problem myself, I did the following:

CREATE TABLE chrs (
    lat varchar(255) charset latin1,
    utf varchar(255) charset utf8
);
INSERT INTO chrs VALUES ('™', '™');
INSERT INTO chrs VALUES (' ', ' '); -- U+00a0

However, this results in:

> SELECT * FROM chrs;
+------+------+
| lat  | utf  |
+------+------+
| ™    | ™    |
|      |      |
+------+------+

I would expect lat to display "Â" and "â„¢," so there is clearly something I do not understand.

What's more is this:

 > SELECT BINARY lat, BINARY utf FROM chrs;
+------------+------------+
| BINARY lat | BINARY utf |
+------------+------------+
| �           | ™          |
| �           |            |
+------------+------------+

This indicates that the values are being stored improperly (?) into lat.

I noticed that SELECT @@character_set_client was utf8, so I changed it to latin1 and did the space insert again, but this yields

|     |     |

for both columns. SELECT BINARY lat displays the space correctly, but SELECT binary utf8 still prints out "Â." I would expect the utf8 column to work more correctly.

To summarize:

  • What does MySQL actually do to characters when you insert them? Does it depend on the column charset, the client set, both, or something else?
  • Is it possible to screw up data at insert time due to a mismatch of the above? Or is it always possible to recover the originally inserted data?
  • What does the charset on the column actually do with respect to storage/display?
like image 301
Explosion Pills Avatar asked Oct 04 '22 07:10

Explosion Pills


1 Answers

In short, your database seems to be OK, except when you explicitly tell it to behave strangely by changing [@@character_set_client] from [utf8] to [latin1]. Otherwise, I think you're seeing the effects of disagreements elsewhere between software components that use UTF-8 vs. Windows-1252.

How do we understand what's happening?

For a start, we recall that in MySQL latin1 really means Windows-1252, an encoding a little different from "Latin-1" proper which is also known as ISO/IEC 8859-1.

Now let's consider the following data concerning the trademark sign and the non-breaking space:

  • Character: "trade mark sign"
  • Unicode point: U+2122
  • UTF-8 hexadecimal bytes: E2 84 A2
  • Latin-1 (ISO 8859-1) hexadecimal byte: there is no code for this character in this encoding
  • Windows 1252 hexadecimal byte: 8D

  • Character: "non-breaking space"

  • Unicode point: U+00A0
  • UTF-8 hexadecimal bytes: C2 A0
  • Latin-1 (ISO 8859-1) hexadecimal byte: A0
  • Windows 1252 hexadecimal byte: A0

Various ways that things go wrong:

  • Characters resulting from interpreting trademark sign UTF-8 hexadecimal bytes as Windows 1252 bytes: â „ ¢
    • "latin small letter a with circumflex", "double low-9 quotation mark", "cent sign"
    • Note: Latin-1 and Unicode do not have a decoding at all for the hexadecimal byte 84 that Windows-1252 defines as "double low-9 quotation mark". Unicode encodes "double low-9 quotation mark" at a code point distant from there, U+201E.
  • Characters resulting from interpreting non-breaking space UTF-8 hexadecimal bytes as Windows 1252 bytes: Â [non-breaking space]
    • "latin large letter a with circumflex", "non-breaking space"
  • Characters resulting from interpreting trademark sign Windows-1252 hexadecimal byte as UTF-8 bytes: [no character: the platform's missing-character mark displays, usually a variation on a question-mark sign]

It appears that when you insert, your database stores the trademark sign in "latin1" as the hexadecimal byte 8D and in "UTF-8" as the hexadecimal bytes E2 A4 A2. It stores the non-breaking space in "latin1" as the hexadecimal byte "A0" and in UTF-8 as the hexadecimal bytes C2 A0. When you do a normal SELECT interactively, the "latin1" trademark sign is translated first into the Unicode point U+2122 and then into UTF-8 hexadecimal bytes E2 84 A2, which finally can be misinterpreted as if they were Windows-1252 bytes.

Where to find the data shown above about characters:

  • http://www.fileformat.info/info/unicode/char/2122/index.htm
  • http://en.wikipedia.org/wiki/ISO/IEC_8859-1
  • http://en.wikipedia.org/wiki/Windows-1252
  • Windows-1252 correlation with Unicode
  • http://www.fileformat.info/info/unicode/char/201e/index.htm
like image 110
minopret Avatar answered Oct 13 '22 10:10

minopret