In a table x
, there is a column with the values u
and ü
.
SELECT * FROM x WHERE column='u'
.
This returns u
AND ü
, although I am only looking for the u
.
The table's collation is utf8mb4_unicode_ci
. Wherever I read about similar problems, everyone suggests to use this collation because they say that utf8mb4
really covers ALL CHARACTERS. With this collation, all character set and collation problems should be solved.
I can insert ü
, è
, é
, à
, Chinese characters
, etc. When I make a SELECT *
, they are also retrieved and displayed correctly.
The problem only occurs when I COMPARE two strings as in above example (SELECT WHERE
) or when I use a UNIQUE INDEX
on the column. When I use the UNIQUE INDEX
, a "ü"
is not inserted when I have a "u"
in the column already. So, when SQL compares u
and ü
in order to decide whether the ü is unique, it thinks it is the same as the u
and doesn't insert the ü
.
I changed everything to utf8mb4
because I don't want to worry about character sets and collation anymore. However, it seems that utf8mb4
isn't the solution either when it comes to COMPARING strings.
I also tried this:
SELECT * FROM x WHERE _utf8mb4 'ü' COLLATE utf8mb4_unicode_ci = column
.
This code is executable (looks pretty sophisticated). However, it also returns ü
AND u
.
I have talked to some people in India and here in China about this issue. We haven't found a solution yet.
If anyone could solve the mystery, it would be really great.
Add_On: After reading all the answers and comments below, here is a code sample which solves the problem:
SELECT * FROM x
WHERE 'ü' COLLATE utf8mb4_bin = column
By adding "COLLATE utf8mb4_bin" to the SELECT query, SQL is invited to put the "binary glasses" (ending _bin) on when it looks at the characters in the column. With the binary glasses on, SQL sees now the binary code in the column. And the binary code is different for every letter and character and emoji which one can think of. So, SQL can now also see the difference between u and ü. Therefore, now it only returns the ü when the SELECT query looks for the ü and doesn't also return the u.
In this way, one can leave everything (database collation, table collation) the same, but only add "COLLATE utf8mb4_bin" to a query when exact differentiation is needed.
(Actually, SQL takes all other glasses off (utf8mb4_german_ci, _general_ci, _unicode_ci etc.) and only does what it does when it is not forced to do anything additional. It simply looks at the binary code and doesn't adjust its search to any special cultural background.)
Thanks everybody for the support, especially to Pred.
If you elect to use UTF-8 as your collation, always use utf8mb4 (specifically utf8mb4_unicode_ci). You should not use UTF-8 because MySQL's UTF-8 is different from proper UTF-8 encoding. This is the case because it doesn't offer full unicode support which can lead to data loss or security issues.
utf8mb4_unicode_ci is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages. utf8mb4_general_ci is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed.
Alternatively, if the database already exists, right-click the database that you want and select Properties. Select the Options page, and select a collation from the Collation drop-down list. After you are finished, select OK.
Collation and character set are two different things.
Character set is just an 'unordered' list of characters and their representation.
utf8mb4
is a character set and covers a lots of characters.
Collation defines the order of characters (determines the end result of order by for example) and defines other rules (such as which characters or character combinations should be treated as same). Collations are derived from character sets, there can be more than one collation for the same character set. (It is an extension to the character set - sorta)
In utf8mb4_unicode_ci
all (most?) accented characters are treated as the same character, this is why you get u
and ü
. In short this collation is an accent insensitive collation.
This is similar to the fact that German collations treat ss
and ß
as same.
utf8mb4_bin
is another collation and it treats all characters as different ones. You may or may not want to use it as default, this is up to you and your business rules.
You can also convert the collation in queries, but be aware, that doing so will prevent MySQL to use indexes.
Here is an example using a similar, but maybe a bit more familiar part of collations:
The ci
at the end of the collations means Case Insensitive
and almost all collations with ci
has a pair ending with cs
, meaning Case Sensitive
.
When your column is case insensitive, the where condition column = 'foo'
will find all of these: foo Foo fOo FoO FOo FoO fOO, FOO.
Now if you try to set the collation to case sensitive (utf8mb4_unicode_cs
for example), all the above values are treated as different values.
The localized collations (like German, UK, US, Hungarian, whatever) follow the rules of the named language. In Germany ss
and ß
are the same and this is stated in the rules of the German language. When a German user searches for a value Straße
, they will expect that a software (supporting german language or written in Germany) will return both Straße
and Strasse
.
To go further, when it comes to ordering, the two words are the same, they are equal, their meaning is the same so there is no particular order.
Don't forget, that the UNIQUE
constraint is just a way of ordering/filtering values. So if there is a unique key defined on a column with German collation, it will not allow to insert both Straße
and Strasse
, since by the rules of the language, they should be treated as equal.
Now lets see our original collation: utf8mb4_unicode_ci
, This is a 'universal' collation, which means, that it tries to simplify everything so since ü
is not a really common character and most users have no idea how to type it in, this collation makes it equal to u
. This is a simplification in order to support most of the languages, but as you already know, these kind of simplifications have some side effects. (like in ordering, filtering, using unique constraints, etc).
The utf8mb4_bin
is the other end of the spectrum. This collation is designed to be as strict as it can be. To achieve this, it literally uses the character codes to distinguish characters. This means, each and every form of a character are different, this collation is implicitly case sensitive and accent sensitive.
Both of these have drawbacks: the localized and general collations are designed for one specific language or to provide a common solution. (utf8mb4_unicode_ci
is the 'extension' of the old utf8_general_ci
collation)
The binary requires extra caution when it comes to user interaction. Since it is CS
and AS
it can confuse users who are used to get the value 'Foo' when they are looking for the value 'foo'. Also as a developer, you have to be extra cautious when it comes to joins and other features. The INNER JOIN 'foo' = 'Foo' will return nothing, since 'foo' is not equal to 'Foo'.
I hope that these examples and explanation helps a bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With