Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 string comparisons in MySQL

We have issues with utf8-string comparisons in MySQL 5, regarding case and accents :

from what I gathered, what MySQL implements collations by considering that "groups of characters should be considered equal".

For example, in the utf8_unicode_ci collation, all the letters "EÉÈÊeéèê" are in the same box (together with other variants of "e").

So if you have a table containing ["video", "vidéo", "vidÉo", "vidÊo", "vidêo", "vidÈo", "vidèo", "vidEo"] (in a varchar column declared with ut8_general_ci collation) :

  • when asking MySQL to sort the rows according to this column, the sorting is random (MySQL does not enforce a sorting rule between "é" and "É" for example),
  • when asking MySQL to add a Unique Key on this column, it raises an error because it considers all the values are equal.

What setting can we fiddle with to fix these two points ?

PS : on a related note, I do not see any case sensitive collation for the utf8 charset. Did I miss something ?


[edit] I think my initial question still holds some interest, and I will leave it as is (and maybe one day get a positive answer).

It turned out, however, that our problems with string comparisons regarding accents was not linked to the collation of our text columns. It was linked to a configuration problem with the character_set_client parameter when talking with MySQL - which defaulted to latin1.

Here is the article that explained it all to us, and allowed us to fix the problem :

Getting out of MySQL character set hell

It is lengthy, but trust me, you need this length to explain both the problem and the fix.

like image 413
LeGEC Avatar asked Aug 11 '11 09:08

LeGEC


1 Answers

Use collation that considers these characters to be distinct. Maybe utf8_bin (it's case sensitive, since it does binary comparison of characters)

http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

like image 151
Mchl Avatar answered Oct 11 '22 14:10

Mchl