I recently upgraded WAMPServer and now in phpMyAdmin, I see both collations utf8_unicode_ci
and utf8_unicode_520_ci
. I was using "utf8_unicode_ci" for Turkish applications, but I wonder what is different in the new *_520
collation?
In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results. If you don't care about correctness, then it's trivial to make any algorithm infinitely fast.
utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
The utf8_bin collation compares strings based purely on their Unicode code point values. If all of the code points have the same values, then the strings are equal. However, this falls apart when you have strings with different composition for combining marks (composed vs.
As documented under Collation Names:
Unicode collation names may include a version number to indicate the version of the Unicode Collation Algorithm (UCA) on which the collation is based. UCA-based collations without a version number in the name use the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. A collation name such as
utf8_unicode_520_ci
is based on UCA 5.2.0 weight keys: http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With