Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding results of PHP's mb_detect_encoding and mb_check_encoding functions

I'm trying to understand the logic of the two functions mb_detect_encoding and mb_check_encoding, but the documentation is poor. Starting with a very simple test string

$string = "\x65\x92";

Which is lowercase 'a' followed by a curly quote mark when using Windows-1252 encoding.

I get the following results:

mb_detect_encoding($string,"Windows-1252"); // false
mb_check_encoding($string,"Windows-1252"); // true
mb_detect_encoding($string,"ISO-8859-1"); // ISO-8859-1
mb_check_encoding($string,"ISO-8859-1"); // true
mb_detect_encoding($string,"UTF-8",true); // false
mb_detect_encoding($string,"UTF-8"); // UTF-8
mb_check_encoding($string,"UTF-8"); // false
  • I don't understand why mb_detect_encoding gives "ISO-8859-1" for the string but not "Windows-1252", when, according to https://en.wikipedia.org/wiki/ISO/IEC_8859-1 and https://en.wikipedia.org/wiki/Windows-1252, the byte x92 is defined in the Windows-1252 character encoding but not in ISO-8859-1.

  • Secondly, I don't understand how mb_detect_encoding can return false, but mb_check_encoding can return true for the same string and same character encoding.

  • Finally, I don't understand why the string can ever be detected as UTF-8, strict mode or not. The byte x92 is a continuation byte in UTF-8, but in this string, it's following a valid character byte, not a leading byte for a sequence.

like image 661
Dom Avatar asked Nov 08 '22 07:11

Dom


1 Answers

Your examples do a good job of showing why mb_detect_encoding should be used sparingly, as it is not intuitive and sometimes logically wrong. If it must be used, always pass in strict = true as the third parameter (so non-UTF8 strings don't get reported as UTF-8.

It's a bit more reliable to run mb_check_encoding over an array of desired encodings, in order of likelihood/priority. For example:

$encodings = [
    'UTF-8',
    'Windows-1252',
    'SJIS',
    'ISO-8859-1',
];

$encoding = 'UTF-8';
$string = 'foo';
foreach ($encodings as $encoding) {
    if (mb_check_encoding($string, $encoding)) {
        // We'll assume encoding is $encoding since it's valid
        break;
    }
}

The ordering depends on your priorities though.

like image 141
Michael Butler Avatar answered Nov 15 '22 11:11

Michael Butler