So first let's see if I get it right:
A charset is a set of symbols and encodings. A collation is a set of rules for comparing characters in a charset.
I should use utf8mb4 because mysql utf8 is a fraud, up-to-3-bytes and not the true up-to-4-bytes real utf8 charset in PHP for example.
As such, utf8mb4 is a charset and utf8mb4_unicode_ci/utf8mb4_bin are 2 of his many differents available collations.
utf8_unicode_ci do case-insensitive comparison and other special comparisons ( I heard it messes up with all the accents in french for example ) . utf8_bin is case-sensitive because it compares the binary values of the character.
Now the questions:
If for example I want to allow Case-Sensitive login names using utf8mb4_unicode_ci I will have to do things like:
SELECT name FROM table WHERE BINARY name = 'MyNaMEiSFUlloFUPPERCases';
If for example I want to allow Case-insensitive search using utf8mb4_bin I will have to do things like:
SELECT name FROM table WHERE LOWER(name) LIKE '%myname%'
So which one is better ? What about the bad things i hear about utf8_unicode_ci and the accents/other special characters ?
Thank you :)
Key differencesutf8mb4_unicode_ci is based on the official Unicode rules for universal sorting and comparison, which sorts accurately in a wide range of languages. utf8mb4_general_ci is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed.
utf8_bin is case-sensitive because it compares the binary values of the character.
Next in the list of "better" collations for general use (as opposed to Spanish-specific, etc) is utf8mb4_unicode_ci . This matches the Unicode Collation Algorithm version 4.0, written several years ago.
utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
Did you "get things right"? Yes, Except that I think that French accents are 'correctly' compared in utf8mb4_unicode_520_ci.
Your two SELECTs
will both to a full table scan, thereby be inefficient. The reason is that you are overriding the collation (for #1) or hiding the column in a function (LOWER
, for #2) or using a leading wildcard (LIKE %...
).
If you want it to be efficient, declare name
to be COLLATION utf8mb4_bin
and do simply WHERE name = ...
.
Do you think some of these equivalences and orderings are 'incorrect' for French?
A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae=Æ=æ az B=b C=c=Ç=ç=Ć=ć=Č=č ch cz D=d=Ð=ð=Ď=ď dz E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě F=f fz ƒ G=g=Ğ=ğ=Ģ=ģ gz H=h hz I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ij iz ı J=j K=k=Ķ=ķ L=l=Ĺ=ĺ=Ļ=ļ=Ł=ł lj=LJ=Lj=lj ll lz M=m N=n=Ñ=ñ=Ń=ń=Ņ=ņ=Ň=ň nz O=o=º=Ò=Ó=Ô=Õ=Ö=Ø=ò=ó=ô=õ=ö=ø oe=Œ=œ oz P=p Q=q R=r=Ř=ř S=s=Ś=ś=Ş=ş=Š=š sh ss=ß sz T=t=Ť=ť TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ů=ů=Ų=ų ue uz V=v W=w X=x Y=y=Ý=ý=ÿ=Ÿ yz Z=z=Ź=ź=Ż=ż=Ž=ž zh zz Þ=þ µ
More utf8 collations . 8.0 and utf8mb4 collations .
The "520" (newer) version by not treating Æ
, Ð
, Ł
, and Ø
as a separate 'letters', and perhaps other things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With