Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best MySQL collation for German language

I am building a web site in German language, So I will be using characters like ä, ü, ß etc., So what are your recommendations?

like image 703
TooCooL Avatar asked Apr 02 '11 21:04

TooCooL


People also ask

What is the best collation for MySQL?

If you're using MySQL 8.0, the default charset is utf8mb4. If you elect to use UTF-8 as your collation, always use utf8mb4 (specifically utf8mb4_unicode_ci).

Is German a UTF-8?

As for what encoding to use, Germans often use ISO/IEC 8859-15, but UTF-8 is increasingly becoming the norm, and can handle any kind of non-ASCII characters at the same time. UTF-8 is actually quite common in Germany now and can make all the difference when using German text.

What is UTF-8 collation in MySQL?

MySQL supports multiple Unicode character sets: utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.

What is MySQL collation types?

In MySQL, such collations are case-insensitive and accent-insensitive. utf8mb4_general_ci is an example: 'a' , 'A' , 'À' , and 'á' each have different character codes but all have a weight of 0x0041 and compare as equal.


3 Answers

This answer is outdated. For full emoji support, see this answer.

As the character set, if you can, definitely UTF-8.

As the collation - that's a bit nasty for languages with special characters. There are various types of collations. They can all store all Umlauts and other characters, but they differ in how they treat Umlauts in comparisons, i.e. whether

u = ü  

is true or false; and in sorting (where in the alphabets the Umlauts are located in the sorting order).

To make a long story short, your best bet is either

utf8_unicode_ci

It allows case insensitive searches; It treats ß as ss and uses DIN-1 sorting. Sadly, like all non-binary Unicode collations, it treats u = ü which is a terrible nuisance because a search for "Muller" will also return "Müller". You will have to work around that by setting a Umlaut-aware collation in real time.

or utf8_bin

This collation does not have the u = ü problem but only case sensitive searches are possible.

I'm not entirely sure whether there are any other side effects to using the binary collation; I asked a question about that here.


This mySQL manual page gives a good overview over the various collations and the consequences they bring in everyday use.

Here is a general overview on available collations in mySQL.

like image 153
Pekka Avatar answered Sep 20 '22 21:09

Pekka


To support the complete UTF-8 standard you have to use the charset utf8mb4 and the collation utf8mb4_unicode_ci in MySQL!

Note: MySQL only supports 1- to 3-byte characters when using its so called utf8 charset! This is why the modern Emojis are not supported as they use 4 Bytes!

The only way to fully support the UTF-8 standard is to change the charset and collation of ALL tables and of the database itself to utf8mb4 and utf8mb4_unicode_ci. Further more, the database connection needs to use utf8mb4 as well.

The mysql server must use utf8mb4 as default charset which can be manually configured in /etc/mysql/conf.d/mysql.cnf

[client] default-character-set = utf8mb4  [mysql] default-character-set = utf8mb4  [mysqld] # character-set-client-handshake = FALSE  ## better not set this! character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci 

Existing tables can be migrated to utf8mb4 using the following SQL statement:

ALTER TABLE <table-name> CONVERT TO  CHARACTER SET utf8mb4  COLLATE utf8mb4_unicode_ci; 

Note:

  • To make sure any JOINs between table-colums will not be slowed down by charset-encodings ALL tables have to be change!
  • As the length of an index is limited in MySQL, the total number of characters per index-row must be multiplied by 4 Byte and need to be smaller than 3072

When the innodb_large_prefix configuration option is enabled, this length limit is raised to 3072 bytes, for InnoDB tables that use the DYNAMIC and COMPRESSED row formats.

To change the charset and default collation of the database, run this command:

ALTER DATABASE CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 

Since utf8mb4 is fully backwards compatible with utf8, no mojibake or other forms of data loss should occur.

like image 45
Roland Avatar answered Sep 23 '22 21:09

Roland


utf-8-general-ci or utf-8-unicode-ci.

To know the difference : UTF-8: General? Bin? Unicode?

like image 20
Sandro Munda Avatar answered Sep 22 '22 21:09

Sandro Munda