How to setup MySQL to handle unicode diacriticals properly?

Question

This is an odd puzzle, AFAIK utf8_bin should guarantee that every accent is stored in the database properly, i.e. without some strange conversion to ASCII. So I have such table with:

DEFAULT CHARSET=utf8 COLLATE=utf8_bin

and yet when I try to compare/query/whatever such entries as "Krąków" and "Kraków" according to MySQL this is the same string.

Out of curiosity I also tried utf8_polish, and MySQL claims that for Polish guys "a" and "ą" do not make any difference.

So how to setup MySQL table, so I could store unicode strings safely, without losing accents and alike?

Server: MySQL 5.5 + openSUSE 11.4, client: Windows 7 + MySQL Workbench 5.2.

Update -- CREATE TABLE

CREATE TABLE `Cities` (
  `city_Name` VARCHAR(145) CHARACTER SET utf8 NOT NULL,
  PRIMARY KEY (`city_Name`)
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Please note that I cannot set a different utf8_bin for column, because entire table is utf8_bin, so in effect collation for column is reset to default.

greenoldman · Accepted Answer

All credits of the solution go to bobince, so please upvote his comment to my question.

The solution to the problem is somewhat strange, and I would risk saying MySQL is broken in this regard.

So, let's say I created a table with utf8 and didn't do anything for column. Later I realize I need strict comparison of characters, so I change the collation for table AND columns to utf8_bin. Solved?

No, now MySQL sees this -- the table is indeed utf8_bin, but column is also utf8_bin, which means column uses the DEFAULT collation of the table. However MySQL does not realize that the previous default is not the same as current default. And thus comparison still does not work.

So you have to shake off that default for column, to some alien value out of scope of the collation "family" (in case of "utf8xxx" means no other "utf8xxx"). Once it is shaken off, and you see entry which does not say "default" at column collation, you can set utf8_bin, which now evaluates to default, but since we come from non-default collation, everything kicks in as expected.

Do not forget to apply the changes at each step.

O. Jones · Answer

The MySQL default charset and collation (which are server-wide but can be changed per connection) apply at the time a table is created. Changing the defaults after the table is created doesn't affect existing tables.

Character sets and collations are attributes of individual columns. They can be set from a table-wide default but they do belong to columns.

A charset of utf8 should be sufficient to allow all European languages to be represented correctly. You should definitely be able to store "a" and "ą" as two different characters.

A collation of utf8-bin yields a case and accented-character sensitive collation.

Here are some examples of the difference between text value and collation behavior. I'm using three sample strings: 'abcd', 'ĄBCD' , and 'ąbcd'. The last two have the A-ogonek letter.

This first example says that with utf8 character representation and utf8_general_ci collation, that the three strings each display as specified by the user, but that they compare equal. That's to be expected in a collation that doesn't distinguish between a and ą. That's a typical case insensitive collation, where all the variant characters are sorted equal to the character without any diacritical marks.

SET NAMES 'utf8' COLLATE 'utf8_general_ci';
SELECT 'abcd', 'ąbcd' , 'abcd' < 'ąbcd',  'abcd' = 'ąbcd';
                               false            true

This next example shows that in the case-insensitive Polish-language collation, a comes before ą. I don't know Polish, but I suspect Polish telephone books have the As and the Ą's separated.

SET NAMES 'utf8' COLLATE 'utf8_polish_ci';
SELECT 'abcd', 'ĄBCD' , 'ąbcd', 'abcd' < 'ĄBCD', 'abcd' < 'ąbcd' , 'ąbcd' = 'ĄBCD' 
                                      true             true              true

This next example shows what happens with the utf8_bin collation.

SET NAMES 'utf8' COLLATE 'utf8_bin';
SELECT 'abcd', 'ĄBCD' , 'ąbcd', 'abcd' < 'ĄBCD', 'abcd' < 'ąbcd' , 'ąbcd' = 'ĄBCD' 
                                      true           true               false

There's one non-intuitive thing to notice in this case. 'abcd' < 'ĄBCD' is true (whereas 'abcd' < 'ABCD' with pure ASCII is false). That's a strange result if you're thinking linguistically. That's because the both A-ogonek characters have binary values in utf8 that are higher than all the abc and ABC characters. So: if you use the utf8-bin collation for ORDER BY operations, you'll get linguistically strange results.

You're saying that 'Krąków' and 'Kraków' compare equal, and that you're puzzled by that. They do compare equal when the collation in use is utf8_general_ci. But they don't with either utf8_bin or utf8_polish_ci. According to the Polish-language support in MySQL, these two spellings of the city's name are different.

As you design your application, you need to sort out how you want all this to work linguistically. Are 'Krąków' and 'Kraków' the same place? Are 'Ąaron' and 'Aaron' the same person? If so, you want utf8_general_ci.

You could consider altering the table you've shown like this:

  ALTER TABLE Cities
MODIFY COLUMN city_Name 
              VARCHAR(145)
              CHARACTER SET utf8 
              COLLATE utf8_general_ci

This will set the column in your table the way you want it.

How to setup MySQL to handle unicode diacriticals properly?

Tags:

mysql

unicode

utf-8

collation

diacritics

Update -- CREATE TABLE

greenoldman

2 Answers

greenoldman

O. Jones

Recent Activity

Donate For Us

How to setup MySQL to handle unicode diacriticals properly?

Tags:

mysql

unicode

utf-8

collation

diacritics

Update -- CREATE TABLE

greenoldman

2 Answers

greenoldman

O. Jones

Related questions

Recent Activity

Donate For Us