What are the differences between utf8_general_ci and utf8_unicode_ci? [duplicate]

Tags:

Possible Duplicate:
What's the difference between utf8_general_ci and utf8_unicode_ci

I've got two options for unicode that look promising for a mysql database.

utf8_general_ci unicode (multilingual), case-insensitive utf8_unicode_ci unicode (multilingual), case-insensitive

Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?

206

asked Jun 24 '09 04:06

reconbot

1 Answers

utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:

converts to Unicode normalization form D for canonical decomposition
removes any combining characters
converts to upper case

This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:

The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.

There are many other subtleties.

utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.

The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to always use utf8_unicode_ci. Well, unless you want wrong answers.

Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

answered Oct 02 '22 05:10

5 revs, 4 users 68%

Related questions
                            
                                Default password of mysql in ubuntu server 16.04
                            
                                How can i optimize MySQL's ORDER BY RAND() function?
                            
                                Difference between SET autocommit=1 and START TRANSACTION in mysql (Have I missed something?)
                            
                                mysql update multiple columns with same now()
                            
                                How do I know if a mysql table is using myISAM or InnoDB Engine?
                            
                                Where does MySQL store database files on Windows and what are the names of the files?
                            
                                When is a timestamp (auto) updated?
                            
                                Is there any reason to worry about the column order in a table?
                            
                                MySQL select statement with CASE or IF ELSEIF? Not sure how to get the result
                            
                                Adding a line break in MySQL INSERT INTO text
                            
                                Fatal error: [] operator not supported for strings
                            
                                What are practical differences between `REPLACE` and `INSERT ... ON DUPLICATE KEY UPDATE` in MySQL?
                            
                                How do I escape special characters in MySQL?
                            
                                How to remove unique key from mysql table
                            
                                Get all characters before space in MySQL
                            
                                Find records with a date field in the last 24 hours [duplicate]
                            
                                MySQL: Get column name or alias from query
                            
                                How to export SQL Server database to MySQL? [duplicate]
                            
                                How to store a datetime in MySQL with timezone info
                            
                                MySQL WHERE IN ()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the differences between utf8_general_ci and utf8_unicode_ci? [duplicate]

Tags:

mysql

character-encoding

unicode

reconbot

People also ask

1 Answers

5 revs, 4 users 68%

Recent Activity

Donate For Us