Possible Duplicate:
What's the difference between utf8_general_ci and utf8_unicode_ci
I've got two options for unicode that look promising for a mysql database.
utf8_general_ci unicode (multilingual), case-insensitive utf8_unicode_ci unicode (multilingual), case-insensitive
Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?
In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results. If you don't care about correctness, then it's trivial to make any algorithm infinitely fast.
utf8mb3 supports only characters in the Basic Multilingual Plane (BMP). utf8mb4 additionally supports supplementary characters that lie outside the BMP. utf8mb3 uses a maximum of three bytes per character. utf8mb4 uses a maximum of four bytes per character.
Everywhere in mariadb/ mysql utf8mb4 should be used. Plain utf8 is a legacy encoding from the time when that extra byte was considered a waste of space. But full utf8 support is always always what engineers want nowadays.
utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.
utf8_general_ci
is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:
This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:
There are many other subtleties.
utf8_unicode_ci
uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE". utf8_general_ci
does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
utf8_unicode_ci
is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci
is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well. The cost of utf8_unicode_ci
is that it is a little bit slower than utf8_general_ci
. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci
doesn’t exist and to always use utf8_unicode_ci
. Well, unless you want wrong answers.
Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With