Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between utf8_unicode_ci and utf8mb4_0900_ai_ci

Tags:

mysql

unicode

What is the difference between utf8mb4_0900_ai_ci and utf8_unicode_ci database text coding in mysql (especially in terms of performance) ?

Update:

There are similar differences between utf8mb4_unicode_ci and utf8mb4_0900_ai_ci?

like image 971
Kamil Kiełczewski Avatar asked Feb 26 '19 12:02

Kamil Kiełczewski


People also ask

What is the difference between utf8_general_ci and utf8_unicode_ci?

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.

What is utf8mb4_0900_ai_ci?

What is the meaning of the MySQL collation utf8mb4_0900_ai_ci? utf8mb4 means that each character is stored as a maximum of 4 bytes in the UTF-8 encoding scheme.

What is utf8_unicode_ci?

utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

Should I use utf8 or utf8mb4?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.


1 Answers

  • The encoding is the same. That is, the bytes look the same.
  • The character set is different. utf8mb4 has more characters.
  • The collation (how comparisions are done) is different.
  • The perfomance is different, but it rarely matters.

utf8_unicode_ci implies the CHARACTER SET utf8, which includes only the 1-, 2-, and 3-byte UTF-8 characters. Hence it excludes most Emoji and some Chinese characters.

utf8mb4_unicode_ci implies the CHARACTER SET utf8mb4 is the corresponding COLLATION for the 4-byte CHARACTER SET utf8mb4.

The Unicode organization has been evolving the specification over the years. Here are the mappings from its "versions" to MySQL Collations:

4.0   _unicode_
5.20  _unicode_520_
9.0   _0900_

Most of the differences will be in areas that most people never encounter. One example: At some point, a change allowed Emoji to be distinguished and ordered in some manner.

The suffix (MySQL doc):

_bin      -- just compare the bits; don't consider case folding, accents, etc
_ci       -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci    -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)

Performance:

_bin         -- simple, fast
_general_ci  -- fails to compare multiple letters; eg ss=ß, so somewhat fast
...          -- slower
_900_        -- (8.0) much faster because of a rewrite

However: The speed of collation is usually the least of the performance issues in queries. INDEXes, JOINs, subqueries, table scans, etc are much more critical to performance.

like image 143
Rick James Avatar answered Oct 09 '22 21:10

Rick James