PHP function mb_detect_encoding strict mode

Question

In the function mb_detect_encoding there is a parameter for strict mode.

In the first, most upvoted comment:

<?php
$str = 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false

This is true, yes. But can anybody give me an explanation, why is it?

user3942918 · Accepted Answer

Everything in this answer is based on my reading of the code here and here.

I did not write it, I did not step through it with a debugger, this is my interpretation only.

It seems that the intention was for strict mode to check if the string as a whole was valid for the encoding, while non-strict mode would allow for a sub-sequence that could be part of a valid string. For example, if the string ended with what should be the first byte of a multi-byte character it would not match in strict mode but would still qualify as UTF-8 under non-strict mode.

However there seems to be a bug* where in non-strict mode only the first byte of the string is being checked in some circumstances.

Example:

The byte 0xf8 is not allowed anywhere in UTF-8. When placed at the start of a string mb_detect_encoding() properly returns false for it regardless of which mode is used.

$str = "\xf8foo";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // bool(false)
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

But as long as the leading byte may occur anywhere in a UTF-8 sequence, non-strict mode returns UTF-8.

$str = "foo\xf8";

var_dump(
    mb_detect_encoding($str, 'UTF-8'),      // string(5) "UTF-8"
    mb_detect_encoding($str, 'UTF-8', true) // bool(false)
);

So while your ISO-8859-1 string 'áéóú' is not valid UTF-8, the first byte "\xe1" can occur in UTF-8 and mb_detect_encoding() mistakenly returns the string as such.

*_{I've opened a report for this at https://bugs.php.net/bug.php?id=72933}

Álvaro González · Answer

áéóú in ISO-8859-1 encodes as:

e1 e9 f3 fa

If you mis-interpret it as UTF-8 you only get four invalid byte sequences. The Multi-Byte extension is basically designed to ignore errors. For instance, mb_convert_encoding() will replace those sequences with question marks or whatever you set with mb_substitute_character().

My educated guess is that strict encoding determines what should be done with invalid byte sequences:

false means to remove them
true means to keep them

If you ignore these invalid sequences you're obviously discarding extremely valuable information and you only get sensible results in very limited circumstances, e.g.

$str = chr(81);
var_dump( mb_detect_encoding($str, ['ISO-8859-1', 'Windows-1252']) );
var_dump( mb_detect_encoding($str, ['Windows-1252', 'ISO-8859-1']) );

To sum up, mb_detect_encoding() is in general not as useful as you may thing and it's total crap with the default parameters.

PHP function mb_detect_encoding strict mode

Tags:

php

character-encoding

vaso123

2 Answers

user3942918

Álvaro González

Recent Activity

Donate For Us

PHP function mb_detect_encoding strict mode

Tags:

php

character-encoding

vaso123

2 Answers

user3942918

Álvaro González

Related questions

Recent Activity

Donate For Us