Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which of utf8 collations is the best? [closed]

I want a UTF8 collation for supporting:

  • English
  • Persian
  • Arabic
  • French
  • Japanese
  • Chinese

Does UTF8_GENERAL_CI support all these Languages?

like image 614
armin etemadi Avatar asked Apr 24 '10 07:04

armin etemadi


People also ask

Which UTF-8 collation should I use?

If you elect to use UTF-8 as your collation, always use utf8mb4 (specifically utf8mb4_unicode_ci). You should not use UTF-8 because MySQL's UTF-8 is different from proper UTF-8 encoding. This is the case because it doesn't offer full unicode support which can lead to data loss or security issues.

Should I use UTF-8 or utf8mb4?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.

What is the difference between utf8_general_ci and utf8_unicode_ci?

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.


2 Answers

Yes, that is correct. UTF-8 is an encoding for the Unicode character set, which supports pretty much every language in the world.

I think the only difference comes with sorting your results, different letters might come in a different order in other languages (accents, umlauts, etc.). Also, comparing a to ä might behave differently in another collation.

The _ci suffix means sorting and comparison happens case insensitive.

http://www.collation-charts.org/ might be of interest to you.

like image 55
knittl Avatar answered Oct 06 '22 00:10

knittl


As UTF8_GENERAL_CI was a good decision some time ago. It has some drawbacks now.

MySQL's UTF8 actually uses 3 bytes instead of 4, which you need for symbols like emojis and new asian chars.

So MySQL has a newer charset called utf8mb4 which actually complies with UTF8 definition.

To be able fully support Asian languages you will need to choose utf8mb4.

If you care about correct sorting in multiple languages, use utf8mb4_unicode or utf8mb4_unicode_ci instead general.

A more detailed answer you can find in What's the difference between utf8_general_ci and utf8_unicode_ci

like image 37
Aistis Avatar answered Oct 06 '22 01:10

Aistis