Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP function mb_detect_encoding strict mode

In the function mb_detect_encoding there is a parameter for strict mode.

In the first, most upvoted comment:

<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false

This is true, yes. But can anybody give me an explanation, why is it?

like image 495
vaso123 Avatar asked Aug 24 '16 07:08

vaso123


2 Answers

Everything in this answer is based on my reading of the code here and here.

I did not write it, I did not step through it with a debugger, this is my interpretation only.


It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.

However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.

Example:

The byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding() properly returns false for it regardless of which mode is used.

$str = "\xf8foo";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.

$str = "foo\xf8";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

So while your ISO-8859-1 string 'áéóú' is not valid UTF-8, the first byte "\xe1" can occur in UTF-8 and mb_detect_encoding() mistakenly returns the string as such.


*I've opened a report for this at https://bugs.php.net/bug.php?id=72933

like image 195
user3942918 Avatar answered Sep 18 '22 13:09

user3942918


áéóú in ISO-8859-1 encodes as:

e1 e9 f3 fa

If you mis-interpret it as UTF-8 you only get four invalid byte sequences. The Multi-Byte extension is basically designed to ignore errors. For instance, mb_convert_encoding() will replace those sequences with question marks or whatever you set with mb_substitute_character().

My educated guess is that strict encoding determines what should be done with invalid byte sequences:

  • false means to remove them
  • true means to keep them

If you ignore these invalid sequences you're obviously discarding extremely valuable information and you only get sensible results in very limited circumstances, e.g.

$str = chr(81);
var_dump( mb_detect_encoding($str, ['ISO-8859-1', 'Windows-1252']) );
var_dump( mb_detect_encoding($str, ['Windows-1252', 'ISO-8859-1']) );

To sum up, mb_detect_encoding() is in general not as useful as you may thing and it's total crap with the default parameters.

like image 43
Álvaro González Avatar answered Sep 21 '22 13:09

Álvaro González