What is the difference between utf8mb4_0900_ai_ci and utf8_unicode_ci database text coding in mysql (especially in terms of performance) ?
Update:
There are similar differences between utf8mb4_unicode_ci and utf8mb4_0900_ai_ci?
In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.
What is the meaning of the MySQL collation utf8mb4_0900_ai_ci? utf8mb4 means that each character is stored as a maximum of 4 bytes in the UTF-8 encoding scheme.
utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.
utf8_unicode_ci implies the CHARACTER SET utf8, which includes only the 1-, 2-, and 3-byte UTF-8 characters. Hence it excludes most Emoji and some Chinese characters.
utf8mb4_unicode_ci implies the CHARACTER SET utf8mb4 is the corresponding COLLATION for the 4-byte CHARACTER SET utf8mb4.
The Unicode organization has been evolving the specification over the years. Here are the mappings from its "versions" to MySQL Collations:
4.0 _unicode_
5.20 _unicode_520_
9.0 _0900_
Most of the differences will be in areas that most people never encounter. One example: At some point, a change allowed Emoji to be distinguished and ordered in some manner.
The suffix (MySQL doc):
_bin -- just compare the bits; don't consider case folding, accents, etc
_ci -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)
Performance:
_bin -- simple, fast
_general_ci -- fails to compare multiple letters; eg ss=ß, so somewhat fast
... -- slower
_900_ -- (8.0) much faster because of a rewrite
However: The speed of collation is usually the least of the performance issues in queries. INDEXes, JOINs, subqueries, table scans, etc are much more critical to performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With