So first let's see if I get it right: A charset is a set of symbols and encodings. A collation is a set of rules for comparing characters in a charset. I should use utf8mb4 because mysql utf8 is a fraud, up-to-3-bytes and not the true up-to-4-bytes real utf8 charset in PHP for example. As such, utf8mb4 is a charset and utf8mb4_unicode_ci/utf8mb4_bin are 2 of his many differents available collations. utf8_unicode_ci do case-insensitive comparison and other special comparisons ( I heard it messes up with all the accents in french for example ) . utf8_bin is case-sensitive because it compares the binary values of the character. Now the questions: <ol> <li> If for example I want to allow Case-Sensitive login names using utf8mb4_unicode_ci I will have to do things like: <pre class="prettyprint"><code>SELECT name FROM table WHERE BINARY name = 'MyNaMEiSFUlloFUPPERCases'; </code></pre> </li> <li> If for example I want to allow Case-insensitive search using utf8mb4_bin I will have to do things like: <pre class="prettyprint"><code>SELECT name FROM table WHERE LOWER(name) LIKE '%myname%' </code></pre> </li> <li>So which one is better ? What about the bad things i hear about utf8_unicode_ci and the accents/other special characters ?</li> </ol> Thank you :)

Did you "get things right"? Yes, Except that I think that French accents are 'correctly' compared in utf8mb4_unicode_520_ci. Your two <code>SELECTs</code> will both to a full table scan, thereby be inefficient. The reason is that you are overriding the collation (for #1) or hiding the column in a function (<code>LOWER</code>, for #2) or using a leading wildcard (<code>LIKE %...</code>). If you want it to be efficient, declare <code>name</code> to be <code>COLLATION utf8mb4_bin</code> and do simply <code>WHERE name = ...</code>. Do you think some of these equivalences and orderings are 'incorrect' for French? <pre class="prettyprint"><code>A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą Aa ae=Æ=æ az B=b C=c=Ç=ç=Ć=ć=Č=č ch cz D=d=Ð=ð=Ď=ď dz E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě F=f fz &fnof; G=g=Ğ=ğ=Ģ=ģ gz H=h hz I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ ij=ĳ iz ı J=j K=k=Ķ=ķ L=l=Ĺ=ĺ=Ļ=ļ=Ł=ł lj=Ǉ=ǈ=ǉ ll lz M=m N=n=Ñ=ñ=Ń=ń=Ņ=ņ=Ň=ň nz O=o=º=Ò=Ó=Ô=Õ=Ö=Ø=ò=ó=ô=õ=ö=ø oe=&OElig;=&oelig; oz P=p Q=q R=r=Ř=ř S=s=Ś=ś=Ş=ş=&Scaron;=&scaron; sh ss=ß sz T=t=Ť=ť TM=tm=™ tz U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ů=ů=Ų=ų ue uz V=v W=w X=x Y=y=Ý=ý=ÿ=&Yuml; yz Z=z=Ź=ź=Ż=ż=Ž=ž zh zz Þ=þ µ </code></pre> More utf8 collations . 8.0 and utf8mb4 collations . The "520" (newer) version by not treating <code>Æ</code>, <code>Ð</code>, <code>Ł</code>, and <code>Ø</code> as a separate 'letters', and perhaps other things.

utf8mb4_unicode_ci vs utf8mb4_bin

Tags:

php

mysql

character-encoding

utf-8

So first let's see if I get it right:

A charset is a set of symbols and encodings. A collation is a set of rules for comparing characters in a charset.

I should use utf8mb4 because mysql utf8 is a fraud, up-to-3-bytes and not the true up-to-4-bytes real utf8 charset in PHP for example.

As such, utf8mb4 is a charset and utf8mb4_unicode_ci/utf8mb4_bin are 2 of his many differents available collations.

utf8_unicode_ci do case-insensitive comparison and other special comparisons ( I heard it messes up with all the accents in french for example ) . utf8_bin is case-sensitive because it compares the binary values of the character.

Now the questions:

If for example I want to allow Case-Sensitive login names using utf8mb4_unicode_ci I will have to do things like:
```
SELECT name FROM table WHERE BINARY name = 'MyNaMEiSFUlloFUPPERCases'; 
```
If for example I want to allow Case-insensitive search using utf8mb4_bin I will have to do things like:
```
SELECT name FROM table WHERE LOWER(name) LIKE '%myname%' 
```
So which one is better ? What about the bad things i hear about utf8_unicode_ci and the accents/other special characters ?

Thank you :)

350

asked May 21 '16 15:05

shrimpdrake

1 Answers

Did you "get things right"? Yes, Except that I think that French accents are 'correctly' compared in utf8mb4_unicode_520_ci.

Your two SELECTs will both to a full table scan, thereby be inefficient. The reason is that you are overriding the collation (for #1) or hiding the column in a function (LOWER, for #2) or using a leading wildcard (LIKE %...).

If you want it to be efficient, declare name to be COLLATION utf8mb4_bin and do simply WHERE name = ....

Do you think some of these equivalences and orderings are 'incorrect' for French?

A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae=Æ=æ  az  B=b  C=c=Ç=ç=Ć=ć=Č=č  ch  cz D=d=Ð=ð=Ď=ď  dz  E=e=È=É=Ê=Ë=è=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě  F=f  fz  ƒ  G=g=Ğ=ğ=Ģ=ģ gz  H=h  hz  I=i=Ì=Í=Î=Ï=ì=í=î=ï=Ī=ī=Į=į=İ  ij=ĳ  iz  ı  J=j  K=k=Ķ=ķ L=l=Ĺ=ĺ=Ļ=ļ=Ł=ł  lj=Ǉ=ǈ=ǉ  ll  lz  M=m  N=n=Ñ=ñ=Ń=ń=Ņ=ņ=Ň=ň  nz O=o=º=Ò=Ó=Ô=Õ=Ö=Ø=ò=ó=ô=õ=ö=ø  oe=Œ=œ  oz  P=p  Q=q  R=r=Ř=ř  S=s=Ś=ś=Ş=ş=Š=š  sh ss=ß  sz  T=t=Ť=ť  TM=tm=™  tz  U=u=Ù=Ú=Û=Ü=ù=ú=û=ü=Ū=ū=Ů=ů=Ų=ų  ue  uz  V=v  W=w  X=x Y=y=Ý=ý=ÿ=Ÿ  yz  Z=z=Ź=ź=Ż=ż=Ž=ž  zh  zz  Þ=þ  µ

More utf8 collations . 8.0 and utf8mb4 collations .

The "520" (newer) version by not treating Æ, Ð, Ł, and Ø as a separate 'letters', and perhaps other things.

151

answered Oct 08 '22 09:10

Rick James

Related questions
                            
                                Laravel Homestead Swift Cannot send message without a sender address
                            
                                How can I get the day of a specific date with PHP
                            
                                Is PHP Object-oriented?
                            
                                Get current URL/URI without some of $_GET variables
                            
                                PHP Checking if the current date is before or after a set date
                            
                                How do I make global helper functions in laravel 5?
                            
                                How to unset multiple variables? [duplicate]
                            
                                Doctrine 2 update from entity
                            
                                Replicating claims as headers is deprecated and will removed from v4.0 - Laravel Passport Problem in lcobucci/jwt package
                            
                                Using str_split on a UTF-8 encoded string
                            
                                MySQL PHP - SELECT WHERE id = array()? [duplicate]
                            
                                Format price in the current locale and currency
                            
                                json_decode returns JSON_ERROR_SYNTAX but online formatter says the JSON is OK
                            
                                PHP Echo Line Breaks
                            
                                RegExp to strip HTML comments
                            
                                PHP If Statement with Multiple Conditions
                            
                                Fatal error: Call to undefined function base_url() in C:\wamp\www\Test-CI\application\views\layout.php on line 5
                            
                                Get last key-value pair in PHP array
                            
                                difference between two arrays
                            
                                zsh: command not found laravel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With