What is the difference between utf8mb4_0900_ai_ci
and utf8_unicode_ci
database text coding in mysql (especially in terms of performance) ?
Update:
There are similar differences between utf8mb4_unicode_ci and utf8mb4_0900_ai_ci?
In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.
What is the meaning of the MySQL collation utf8mb4_0900_ai_ci? utf8mb4 means that each character is stored as a maximum of 4 bytes in the UTF-8 encoding scheme.
utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.
utf8_unicode_ci
implies the CHARACTER SET utf8
, which includes only the 1-, 2-, and 3-byte UTF-8 characters. Hence it excludes most Emoji and some Chinese characters.
utf8mb4_unicode_ci
implies the CHARACTER SET utf8mb4
is the corresponding COLLATION
for the 4-byte CHARACTER SET utf8mb4
.
The Unicode organization has been evolving the specification over the years. Here are the mappings from its "versions" to MySQL Collations:
4.0 _unicode_
5.20 _unicode_520_
9.0 _0900_
Most of the differences will be in areas that most people never encounter. One example: At some point, a change allowed Emoji to be distinguished and ordered in some manner.
The suffix (MySQL doc):
_bin -- just compare the bits; don't consider case folding, accents, etc
_ci -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)
Performance:
_bin -- simple, fast
_general_ci -- fails to compare multiple letters; eg ss=ß, so somewhat fast
... -- slower
_900_ -- (8.0) much faster because of a rewrite
However: The speed of collation is usually the least of the performance issues in queries. INDEXes
, JOINs
, subqueries, table scans, etc are much more critical to performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With