I find it infuriating that I don't understand this yet, but maybe some explanation will help. This is a two part question, but hopefully both parts are small and directly related:
We recently had an issue where content was inserting U+00a0
(non-breaking space) characters into a DB column with the latin1
charset. Simply doing SELECT
prints out "Â" in the column. I'm not sure if this is a product of selecting or of the display, but I believe it is the former. SELECT BINARY col
instead prints out " ", as well it should since my shell has $LANG = en_US.utf8
.
A more visible example is "â„¢" vs. "™"
Using SELECT CONVERT(col USING utf8)
still prints out "Â" and "â„¢" -- I wouldn't necessarily expect it to do differently, but where does the problem stem from? Is it a problem that occurs at storage time? Is there a way to get the UTF8 display out of the DB instead of relying on the UI to display it properly (if that makes sense?)
In an attempt to reproduce this problem myself, I did the following:
CREATE TABLE chrs (
lat varchar(255) charset latin1,
utf varchar(255) charset utf8
);
INSERT INTO chrs VALUES ('™', '™');
INSERT INTO chrs VALUES (' ', ' '); -- U+00a0
However, this results in:
> SELECT * FROM chrs;
+------+------+
| lat | utf |
+------+------+
| ™ | ™ |
| | |
+------+------+
I would expect lat
to display "Â" and "â„¢," so there is clearly something I do not understand.
What's more is this:
> SELECT BINARY lat, BINARY utf FROM chrs;
+------------+------------+
| BINARY lat | BINARY utf |
+------------+------------+
| � | ™ |
| � | |
+------------+------------+
This indicates that the values are being stored improperly (?) into lat
.
I noticed that SELECT @@character_set_client
was utf8
, so I changed it to latin1
and did the space insert again, but this yields
| Â | Â |
for both columns. SELECT BINARY lat
displays the space correctly, but SELECT binary utf8
still prints out "Â." I would expect the utf8
column to work more correctly.
charset
on the column actually do with respect to storage/display?In short, your database seems to be OK, except when you explicitly tell it to behave strangely by changing [@@character_set_client] from [utf8] to [latin1]. Otherwise, I think you're seeing the effects of disagreements elsewhere between software components that use UTF-8 vs. Windows-1252.
How do we understand what's happening?
For a start, we recall that in MySQL latin1 really means Windows-1252, an encoding a little different from "Latin-1" proper which is also known as ISO/IEC 8859-1.
Now let's consider the following data concerning the trademark sign and the non-breaking space:
Windows 1252 hexadecimal byte: 8D
Character: "non-breaking space"
Various ways that things go wrong:
It appears that when you insert, your database stores the trademark sign in "latin1" as the hexadecimal byte 8D and in "UTF-8" as the hexadecimal bytes E2 A4 A2. It stores the non-breaking space in "latin1" as the hexadecimal byte "A0" and in UTF-8 as the hexadecimal bytes C2 A0. When you do a normal SELECT interactively, the "latin1" trademark sign is translated first into the Unicode point U+2122 and then into UTF-8 hexadecimal bytes E2 84 A2, which finally can be misinterpreted as if they were Windows-1252 bytes.
Where to find the data shown above about characters:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With