In the function mb_detect_encoding there is a parameter for strict mode.
In the first, most upvoted comment:
<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false
This is true, yes. But can anybody give me an explanation, why is it?
Everything in this answer is based on my reading of the code here and here.
I did not write it, I did not step through it with a debugger, this is my interpretation only.
It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.
However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.
Example:
The byte 0xf8
is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding()
properly returns false for it regardless of which mode is used.
$str = "\xf8foo";
var_dump(
mb_detect_encoding($str, 'UTF-8'), // bool(false)
mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);
But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.
$str = "foo\xf8";
var_dump(
mb_detect_encoding($str, 'UTF-8'), // string(5) "UTF-8"
mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);
So while your ISO-8859-1 string 'áéóú'
is not valid UTF-8, the first byte "\xe1"
can occur in UTF-8 and mb_detect_encoding()
mistakenly returns the string as such.
*I've opened a report for this at https://bugs.php.net/bug.php?id=72933
áéóú
in ISO-8859-1 encodes as:
e1 e9 f3 fa
If you mis-interpret it as UTF-8 you only get four invalid byte sequences. The Multi-Byte extension is basically designed to ignore errors. For instance, mb_convert_encoding()
will replace those sequences with question marks or whatever you set with mb_substitute_character()
.
My educated guess is that strict encoding determines what should be done with invalid byte sequences:
false
means to remove themtrue
means to keep themIf you ignore these invalid sequences you're obviously discarding extremely valuable information and you only get sensible results in very limited circumstances, e.g.
$str = chr(81);
var_dump( mb_detect_encoding($str, ['ISO-8859-1', 'Windows-1252']) );
var_dump( mb_detect_encoding($str, ['Windows-1252', 'ISO-8859-1']) );
To sum up, mb_detect_encoding()
is in general not as useful as you may thing and it's total crap with the default parameters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With