Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL collation: utf8mb4_unicode_ci vs "utf8mb4 - default collation"

Please help me to understand the differences between the collations listed in MySQL Workbench:

utf8mb4_unicode_ci vs utf8mb4 - default collation

p.s. Everyone is recommending using utf8mb4_unicode_ci. If this is so popular why it is not default? What differs it from the default?

I use MySQL 5.7.21.

MySQL Workbench

like image 516
Yevgeniy Afanasyev Avatar asked Jul 11 '18 06:07

Yevgeniy Afanasyev


People also ask

What is the difference between utf8mb4_unicode_ci and utf8mb4_general_ci?

utf8mb4_unicode_ci also supports contractions and ignorable characters. utf8mb4_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

Is utf8mb4_unicode_ci case sensitive?

So, since your collation is utf8mb4_unicode_ci , then it is both case-insensitive and accent-insensitive.

What is collate utf8mb4_0900_ai_ci?

A collation name starts with the name of the character set with which it is associated, generally followed by one or more suffixes indicating other collation characteristics. For example, utf8mb4_0900_ai_ci and latin1_swedish_ci are collations for the utf8mb4 and latin1 character sets, respectively.

What is MySQL collation?

A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations. A collation orders characters based on weights.

What is the default collation for utf8mb4?

The default collation (before MySQL 8.0) for utf8mb4 is utf8mb4_general_ci. This checks only one byte at a time, so ss is not considered equal to ß. Most of the other collations for utf8mb4 do consider them equal.

What is the Unicode Collation Algorithm version for MySQL?

This matches the Unicode Collation Algorithm version 4.0, written several years ago. Then comes utf8mb4_unicode_520_ci (Unicode 5.20), which handles more things "correctly". When you get to MySQL 8.0, there will be a 9.0 version, utf8mb4_0900_ai_ci.

What is the default collation for each character set?

Each character set has a default collation. For example, the default collations for utf8mb4 and latin1 are utf8mb4_0900_ai_ci and latin1_swedish_ci, respectively. The INFORMATION_SCHEMA CHARACTER_SETS table and the SHOW CHARACTER SET statement indicate the default collation for each character set.

How do I make MySQL default to UTF8?

You need to do two things. 1) Change your mysql to have utf8mb4 as its character set and 2) Change your database to utf8mb4. The descriptions elsewhere in this section cover making the utf8 database versions using mysqldump. To make mysql default to utf8 you can edit /etc/my.cnf as follows. In the client section of my.cnf


1 Answers

utf8mb4_default?? Where do you see this?

The default collation (before MySQL 8.0) for utf8mb4 is utf8mb4_general_ci. This checks only one byte at a time, so ss is not considered equal to ß. Most of the other collations for utf8mb4 do consider them equal.

Next in the list of "better" collations for general use (as opposed to Spanish-specific, etc) is utf8mb4_unicode_ci. This matches the Unicode Collation Algorithm version 4.0, written several years ago.

Then comes utf8mb4_unicode_520_ci (Unicode 5.20), which handles more things "correctly".

When you get to MySQL 8.0, there will be a 9.0 version, utf8mb4_0900_ai_ci.

For details on the differences, see http://mysql.rjweb.org/utf8_collations.html . (Note: "utf8" versus "utf8mb4" work the same for the information provided on that page.) The first thing to note:

utf8_general_ci              A=a=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą    Aa  ae          az
utf8_unicode_ci              A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae          az            Æ=æ
utf8_unicode_520_ci          A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae=Æ=æ      az

These 3 lines point out 3 different treatments of Æ and æ.

  • Those two ligatures are treated equal ("case insensitive").
  • general does not sort it anywhere near the other A's. (Far below, we see that they sort after Z.)
  • unicode sorts them after all A's, and just before B, as if they were a separate "letter".
  • _unicode_520_ treats them as equal to letter pair ae.

For 5.7, and without any specific language requirements, I would use utf8mb4_unicode_520_ci.

Back to your question of "why". Changing defaults runs the risk of hurting existing installations more than it helps. So, I guess, the designers were conservative. On the other hand, 8.0 has a lot of major changes, so there was less reluctance to change. Hence, the move to utf8mb4_0900_ai_ci.

like image 172
Rick James Avatar answered Oct 12 '22 18:10

Rick James