Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the differences between utf8_general_ci and utf8_unicode_ci? [duplicate]

Possible Duplicate:
What's the difference between utf8_general_ci and utf8_unicode_ci

I've got two options for unicode that look promising for a mysql database.

utf8_general_ci unicode (multilingual), case-insensitive utf8_unicode_ci unicode (multilingual), case-insensitive 

Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?

like image 206
reconbot Avatar asked Jun 24 '09 04:06

reconbot


People also ask

What is the difference between utf8_general_ci and utf8_unicode_ci?

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results. If you don't care about correctness, then it's trivial to make any algorithm infinitely fast.

What is difference between utf8mb3 and utf8mb4?

utf8mb3 supports only characters in the Basic Multilingual Plane (BMP). utf8mb4 additionally supports supplementary characters that lie outside the BMP. utf8mb3 uses a maximum of three bytes per character. utf8mb4 uses a maximum of four bytes per character.

Should I use utf8mb4 or utf8?

Everywhere in mariadb/ mysql utf8mb4 should be used. Plain utf8 is a legacy encoding from the time when that extra byte was considered a waste of space. But full utf8 support is always always what engineers want nowadays.

What does utf8mb4 mean?

utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.


1 Answers

utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:

  • converts to Unicode normalization form D for canonical decomposition
  • removes any combining characters
  • converts to upper case

This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:

  • The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
  • There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
  • Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.

There are many other subtleties.

  1. utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

  1. utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.

The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to always use utf8_unicode_ci. Well, unless you want wrong answers.

Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

like image 50
5 revs, 4 users 68% Avatar answered Oct 02 '22 05:10

5 revs, 4 users 68%