Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the purpose of the MB_CASE_*_SIMPLE constants?

Tags:

php

mbstring

According to the manual, the following constants have been added in PHP 7.3:

  • MB_CASE_FOLD
  • MB_CASE_LOWER_SIMPLE
  • MB_CASE_UPPER_SIMPLE
  • MB_CASE_TITLE_SIMPLE
  • MB_CASE_FOLD_SIMPLE

I found an example of what MB_CASE_FOLD does:

echo mb_convert_case('ẞ', MB_CASE_FOLD, 'UTF-8'); // ss

However, I could not find any reference to what the MB_CASE_*_SIMPLE constants do.

At first glance, with simple latin1 characters, MB_CASE_LOWER_SIMPLE behaves just like MB_CASE_LOWER.

What do the MB_CASE_*_SIMPLE do different from their MB_CASE_* counterparts?

like image 730
BenMorel Avatar asked Nov 14 '19 14:11

BenMorel


2 Answers

We can find the corresponding C implementation at https://github.com/php/php-src/blob/master/ext/mbstring/php_unicode.c#L223

And have a look at the git commit message:

  • Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string.

  • mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are:

    • MB_CASE_LOWER (used by mb_strtolower)
    • MB_CASE_UPPER (used by mb_strtolower)
    • MB_CASE_TITLE
    • MB_CASE_FOLD
    • MB_CASE_LOWER_SIMPLE
    • MB_CASE_UPPER_SIMPLE
    • MB_CASE_TITLE_SIMPLE
    • MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)

So those constants with _SIMPLE suffix are for Unicode's Simple Case Folding, and those WITHOUT the suffix are for Full Case Folding.

And that answers the differences on Full Case Folding vs Simple Case Folding.

like image 140
Chris Lam Avatar answered Nov 04 '22 22:11

Chris Lam


Here are some examples where it matters:

MB_CASE_UPPER_SIMPLE:

mb_convert_encoding("ß", MB_CASE_UPPER_SIMPLE); // "ß"
mb_convert_encoding("ß", MB_CASE_UPPER); // "SS"

MB_CASE_LOWER_SIMPLE:

mb_convert_encoding("İ", MB_CASE_LOWER_SIMPLE); // "i"
mb_convert_encoding("İ", MB_CASE_LOWER); // "i\xcc\x87"

MB_CASE_TITLE_SIMPLE is similar to MB_CASE_UPPER_SIMPLE in the same way that MB_CASE_UPPER is similar to MB_CASE_TITLE.

like image 31
Anonymous Avatar answered Nov 04 '22 20:11

Anonymous