Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

utf8 collation difference between unicode and danish

Well hello there. I'm changing the charset of a database from latin1_sweedish_ci to utf8. I've allways used utf8_danish_ci because it's the closest to the Norwegian character style - I think.
But what about utf8_general_ci and utf8_unicode_ci?

Some time ago; It was preferred to use _general_ci for better/faster performance, and _unicode_ci for more accuracy because of the sorting algorithm is more complex in the latter. But since the speed/performace is no longer an issue - or not so much an issue i most cases anymore - _unicode_ci is ok to use in most situations?

But how does _unicode_ci differ from _danish_ci?
Is it the last three letters æ, ø, å in the nordic alphabet that is taken into account?

Most comparisson (one vs the other) I can find is only between _general_ci and _unicode_ci.

Anyone know of any examples for when to use _unicode_ci or when to use _danish_ci would be highly appreciated...

like image 404
ThomasK Avatar asked Mar 24 '23 23:03

ThomasK


2 Answers

In short, if your app is multilingual and storing multiple languages within the same tables, you're mostly screwed and should worry about sorting/collating outside of the database -- utf8_general_ci is then as good as any other.

If it only supports a single language, you'll do fine by setting the correct collation at the db level -- in your case, utf8_danish_ci indeed, since it's the same as norwegian if wikipedia is anything to go by.

In case you'd like to read more on collation, the ICU docs has colorful examples of how thorny this kind of stuff gets. Quoting extensively:

http://userguide.icu-project.org/collation

[H]ere are some of the ways languages vary in ordering strings:

The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".

Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".

Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".

Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".

Unaccented letters that are considered distinct in one language can be indistinct in another. For example, the letters "v" and "w" are two different letters according to English. However, "v" and "w" are considered variant forms of the same letter in Swedish.

A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae".

Thai requires that the order of certain letters be reversed.

French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o".

Sometimes lowercase letters sort before uppercase letters. The reverse is required in other situations. For example, lowercase letters are usually sorted before uppercase letters in English. Latvian letters are the exact opposite.

Even in the same language, different applications might require different sorting orders. For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.

Sorting orders can change over time due to government regulations or new characters/scripts in Unicode.

like image 120
Denis de Bernardy Avatar answered Apr 03 '23 17:04

Denis de Bernardy


Please bear in mind that Collation != Encoding.

Encoding is the mapping between integers (which all that a database can store at the end of the day) and human readable graphic representations of characters.

Collation is the ordering rule used to sort characters according to the conventional alphabetical order of a given language. Notice that this ordering does not reflect the actual order of the internal, numeric representation.

Your question reduces to the following: what kind of alphabetical order should you use in your application? This cannot be answered.

like image 30
RandomSeed Avatar answered Apr 03 '23 16:04

RandomSeed