Please help me to understand the differences between the collations listed in MySQL Workbench:
utf8mb4_unicode_ci
vs utf8mb4 - default collation
p.s. Everyone is recommending using utf8mb4_unicode_ci
. If this is so popular why it is not default? What differs it from the default?
I use MySQL 5.7.21.
utf8mb4_unicode_ci also supports contractions and ignorable characters. utf8mb4_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
So, since your collation is utf8mb4_unicode_ci , then it is both case-insensitive and accent-insensitive.
A collation name starts with the name of the character set with which it is associated, generally followed by one or more suffixes indicating other collation characteristics. For example, utf8mb4_0900_ai_ci and latin1_swedish_ci are collations for the utf8mb4 and latin1 character sets, respectively.
A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations. A collation orders characters based on weights.
The default collation (before MySQL 8.0) for utf8mb4 is utf8mb4_general_ci. This checks only one byte at a time, so ss is not considered equal to ß. Most of the other collations for utf8mb4 do consider them equal.
This matches the Unicode Collation Algorithm version 4.0, written several years ago. Then comes utf8mb4_unicode_520_ci (Unicode 5.20), which handles more things "correctly". When you get to MySQL 8.0, there will be a 9.0 version, utf8mb4_0900_ai_ci.
Each character set has a default collation. For example, the default collations for utf8mb4 and latin1 are utf8mb4_0900_ai_ci and latin1_swedish_ci, respectively. The INFORMATION_SCHEMA CHARACTER_SETS table and the SHOW CHARACTER SET statement indicate the default collation for each character set.
You need to do two things. 1) Change your mysql to have utf8mb4 as its character set and 2) Change your database to utf8mb4. The descriptions elsewhere in this section cover making the utf8 database versions using mysqldump. To make mysql default to utf8 you can edit /etc/my.cnf as follows. In the client section of my.cnf
utf8mb4_default
?? Where do you see this?
The default collation (before MySQL 8.0) for utf8mb4 is utf8mb4_general_ci. This checks only one byte at a time, so ss
is not considered equal to ß
. Most of the other collations for utf8mb4 do consider them equal.
Next in the list of "better" collations for general use (as opposed to Spanish-specific, etc) is utf8mb4_unicode_ci
. This matches the Unicode Collation Algorithm version 4.0, written several years ago.
Then comes utf8mb4_unicode_520_ci
(Unicode 5.20), which handles more things "correctly".
When you get to MySQL 8.0, there will be a 9.0 version, utf8mb4_0900_ai_ci
.
For details on the differences, see http://mysql.rjweb.org/utf8_collations.html . (Note: "utf8" versus "utf8mb4" work the same for the information provided on that page.) The first thing to note:
utf8_general_ci A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az
utf8_unicode_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae az Æ=æ
utf8_unicode_520_ci A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae=Æ=æ az
These 3 lines point out 3 different treatments of Æ
and æ
.
Z
.)ae
.For 5.7, and without any specific language requirements, I would use utf8mb4_unicode_520_ci
.
Back to your question of "why". Changing defaults runs the risk of hurting existing installations more than it helps. So, I guess, the designers were conservative. On the other hand, 8.0 has a lot of major changes, so there was less reluctance to change. Hence, the move to utf8mb4_0900_ai_ci.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With